2024's hottest topics in databases (a bibliometric approach)

28 Mar 2025

From database companies to renowned professor of databaseology Andy Pavlo, everyone seems to be writing their “year in review” of the database world. While I could never match Andy’s wit nor his no-doubt earnest adoration for Larry Ellison, I thought I might be able to provide some additional insights into the hottest topics in database research from a bibliometric perspective.

For the past few years, I’ve maintained a “ranking” of database researchers and papers using the citation graph and PageRank. We can use the same data to identify papers with the fastest growing citation counts, which is arguably a reasonable proxy for “hotness.” Note that these papers are the papers being cited a lot, so they might represent the baseline for a particular area, or they might be the first paper in a line of work, and might not represent the actual “hotness” themselves.

Below, I’ve organized the 20 papers with the fastest growing non-self-citations.

Learned indexes

Rank	Title	Year	Non-self Citations	Per Year
1	The Case for Learned Index Structures.	2018	142	20
12	ALEX: An Updatable Adaptive Learned Index.	2020	59	11

Since Tim Kraska’s paper on learned indexes (written during a sabbatical at Google, and preceding his subsequent move from Brown to MIT), the database community has been hard at work developing new learned indexes, benchmarking them, and implementing them into real systems. A recent survey paper from Purdue shows just how explosive the subject has become:

A genealogy tree of learned indexes

The ALEX paper was (to the best of my knowledge) the first paper to make learned indexes updatable, leveraging an adaptive tree structure. The first author, Jialin Ding, was a PhD student in Kraska’s lab, and is now faculty at Princeton. Jialin is also the led author on a paper describing how multi-dimensional learned indexes were integrated into AWS Redshift.

It’s been a while since Kraska’s paper made its initial splash, but (for a SIGMOD paper) it was surprisingly controversial. Blog posts about if we should throw out our algorithm textbooks or simply use B-trees were everywhere. In my opinion, part of this confusion was because the original SIGMOD paper did not come with any code. And, if you don’t believe in an idea, it’s very easy to whip up a crappy implementation. Fortunately, since 2018, we now have a much better idea why learned indexes work.

(Learned) query optimization

Rank	Title	Year	Non-self Citations	Per Year
2	Learned Cardinalities: Estimating Correlated Joins with Deep Learning.	2019	124	20
3	Neo: A Learned Query Optimizer.¹	2019	116	19
4	How Good Are Query Optimizers, Really?	2015	188	18
7	Bao: Making Learned Query Optimization Practical.¹	2021	57	14
19	An End-to-End Learning-based Cost Estimator.	2019	61	10
20	Are We Ready For Learned Cardinality Estimation?	2021	42	10

Arguably since the 2001 IBM Leo paper, we’ve been trying to figure out how to build query optimizers that learn from their mistakes. After all, if the optimizer chooses a bad query plan, why shouldn’t it pick a better one next time?

The database community has been attacking this problem from (at least) three different directions, each of which is represented in the “hottest” papers:

Using machine learning methods, like deep learning, to replace cardinality estimates,
using semi-supervised or guided learning techniques to replace cost models,
and using reinforcement learning to either replace or steer the query optimization process.

While academics continue to wax poetic about the merits of each approach, practitioners in industry also made some inroads with learned query optimization, including a deployment of “steered” optimizers at Microsoft and Meta¹, as well as integrations of learned performance predictors into AWS Redshift’s intelligent scaling and workload manager¹.

The classic “How Good Are Query Optimizers, Really?” paper also makes an appearance, as the use of the join order benchmark (JOB) has become ubiquitous in query optimization research. Of course, this is also a seminal paper in query optimization!

DBMS engines

Rank	Title	Year	Non-self Citations	Per Year
5	Spark SQL: Relational Data Processing in Spark.	2015	172	17
6	Pregel: a system for large-scale graph processing.	2010	214	14
10	The Snowflake Elastic Data Warehouse.	2016	102	11
11	Hekaton: SQL server's memory-optimized OLTP engine.	2013	138	11
13	Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases.	2017	94	11
16	DuckDB: an Embeddable Analytical Database.	2019	65	10
18	Efficiently Compiling Efficient Query Plans for Modern Hardware.	2011	151	10

The execution model and engines of DBMSes will always be a core part of the database community’s interests. Recently, researchers have started to build on “new” foundations!

Papers building on Spark SQL continue to emphasize ease-of-use and compatibility with distributed and open source (mostly Apache) runtimes.
Papers building on Pregel continue to tackle graph database problems, which is far from solved – graph databases are interesting because they seem to warrant both a new query language and a new execution model.
Papers building on Hekaton continue to push the boundaries of in-memory data processing, often integrating PIM elements or pushing the performance envelope.
The database community’s obsession with clouds continues to show in works building off of Snowflake and Aurora.

Perhaps most notably, at least from my perspective, is the most-recent addition, DuckDB. For a long time, vectorized out-of-core² analytics databases were only available as proprietary products (like Vertica) or locked away in TU Munich’s code repositories (like Hyper, which was eventually sold to Tableau, then Salesforce). DuckDB not only provides a state-of-the-art implementation of a vectorized database, but also packages it up as an easy-to-use, embedded tool. As a result, DuckDB has become the de facto testbed for implementing new OLAP techniques, like new research on join ordering or confidential computing.

ML-powered system tuning

Rank	Title	Year	Non-self Citations	Per Year
8	Automatic Database Management System Tuning Through Large-scale Machine Learning.	2017	109	13
17	An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning.	2019	60	10

Databases have knobs. Lots and lots of knobs. Whether it’s the classic PostgreSQL shared_buffers or something more obscure like Vertica’s resource pools, getting the knobs set to the “right” values for your workload has always been an essential performance tuning step.

The two hottest papers in this category both represent machine learning powered techniques, either using reinforcement learning or Bayesian optimization. The approaches building on these two works have made a lot of progress in both cloud and on-premises environments. Of particular note is the DB-BERT line of work from Immanuel Trummer’s lab at Cornell, which integrates LLMs into the already-ML heavy area, resulting in a database tuner that can, almost literally, “read the manual.”

Data cleaning and matching

Rank	Title	Year	Non-self Citations	Per Year
9	Deep Entity Matching with Pre-Trained Language Models.	2020	66	13
14	HoloClean: Holistic Data Repairs with Probabilistic Inference.	2017	90	11
15	Deep Learning for Entity Matching: A Design Space Exploration.	2018	71	10

Database systems are only as useful as the data they contain! Garbage in, garbage out. Tools for data cleaning and entity matching are critical for bringing structure to chaos, enabling analysis and hopefully dredging up some of that good good Actionable Business Insight we love to talk about.

The prevailing trend, as you might imagine, is integrating LLMs and other recent ML inventions into traditional data cleaning and matching tools. While the latency of these systems still leaves something to be desired, the accuracy provided by these new LLM-enabled tools is sometimes orders of magnitude ahead of what we could do before.

Some good examples of this trend include Avanika Narayan’s work in Chris Re’s lab, which show that LLMs (“foundation models”) are quite effective at a number of data wrangling tasks. A slightly more involved approach is Nan Tang’s recent paper on relational pre-trained transformer (which is apparently almost all you need), which is directly trained for data cleaning tasks and can even operate on complex, non-1NF types like JSON (but we always put our databases into 3NF, right?).

What’s next?

Another year goes by and databases get a little faster, a little cleaner, a little easier to use, and a little smarter. Maybe next year the DataFusion team or the DuckDB team will make them a little cheaper! But nobody tell the VCs.

Perhaps you want to hear the no-doubt enlightened opinion of me, your humble author? Well, I think smarter, faster, cheaper, and more accessible databases are great, but how come nobody is looking into making databases more polite? Look what happens if I misspell WHERE:

imdb=# select count(*) from title were production_year=2022;
ERROR:  syntax error at or near "production_year"
LINE 1: select count(*) from title were production_year=2022;

Frankly rude – no respect from PostgreSQL here. I guess if Mike Stonebraker raised me in a basement at Berkeley in the 80s, I might also be missing a few “pleases” and “thank yous” around the edges. Or maybe the middle is already hollow and the papers are all diarrhea, what do I know?

(If you actually want to know what I’ve been up to, check out our latest work on offline query optimization – SIGMOD papers coming soon!)

I am an author of this paper. ↩ ↩² ↩³ ↩⁴
MonetDB is arguably an exception to this claim, since MonetDB could mmap larger-than-memory chunks, but I think it is reasonable to categorize MonetDB as a mostly in-memory analytics database. ↩