2024's hottest topics in databases (a bibliometric approach)
From database companies to renowned professor of databaseology Andy Pavlo, everyone seems to be writing their “year in review” of the database world. While I could never match Andy’s wit nor his no-doubt earnest adoration for Larry Ellison, I thought I might be able to provide some additional insights into the hottest topics in database research from a bibliometric perspective.
For the past few years, I’ve maintained a “ranking” of database researchers and papers using the citation graph and PageRank. We can use the same data to identify papers with the fastest growing citation counts, which is arguably a reasonable proxy for “hotness.” Note that these papers are the papers being cited a lot, so they might represent the baseline for a particular area, or they might be the first paper in a line of work, and might not represent the actual “hotness” themselves.
Below, I’ve organized the 20 papers with the fastest growing non-self-citations.
Learned indexes
Rank | Title | Year | Non-self Citations |
Per Year |
---|---|---|---|---|
1 | The Case for Learned Index Structures. | 2018 | 142 | 20 |
12 | ALEX: An Updatable Adaptive Learned Index. | 2020 | 59 | 11 |
Since Tim Kraska’s paper on learned indexes (written during a sabbatical at Google, and preceding his subsequent move from Brown to MIT), the database community has been hard at work developing new learned indexes, benchmarking them, and implementing them into real systems. A recent survey paper from Purdue shows just how explosive the subject has become:
The ALEX paper was (to the best of my knowledge) the first paper to make learned indexes updatable, leveraging an adaptive tree structure. The first author, Jialin Ding, was a PhD student in Kraska’s lab, and is now faculty at Princeton. Jialin is also the led author on a paper describing how multi-dimensional learned indexes were integrated into AWS Redshift.
It’s been a while since Kraska’s paper made its initial splash, but (for a SIGMOD paper) it was surprisingly controversial. Blog posts about if we should throw out our algorithm textbooks or simply use B-trees were everywhere. In my opinion, part of this confusion was because the original SIGMOD paper did not come with any code. And, if you don’t believe in an idea, it’s very easy to whip up a crappy implementation. Fortunately, since 2018, we now have a much better idea why learned indexes work.
(Learned) query optimization
Rank | Title | Year | Non-self Citations |
Per Year |
---|---|---|---|---|
2 | Learned Cardinalities: Estimating Correlated Joins with Deep Learning. | 2019 | 124 | 20 |
3 | 2019 | 116 | 19 | |
4 | How Good Are Query Optimizers, Really? | 2015 | 188 | 18 |
7 | 2021 | 57 | 14 | |
19 | An End-to-End Learning-based Cost Estimator. | 2019 | 61 | 10 |
20 | Are We Ready For Learned Cardinality Estimation? | 2021 | 42 | 10 |
Arguably since the 2001 IBM Leo paper, we’ve been trying to figure out how to build query optimizers that learn from their mistakes. After all, if the optimizer chooses a bad query plan, why shouldn’t it pick a better one next time?
The database community has been attacking this problem from (at least) three different directions, each of which is represented in the “hottest” papers:
- Using machine learning methods, like deep learning, to replace cardinality estimates,
- using semi-supervised or guided learning techniques to replace cost models,
- and using reinforcement learning to either replace or steer the query optimization process.
While academics continue to wax poetic about the merits of each approach, practitioners in industry also made some inroads with learned query optimization, including a deployment of “steered” optimizers at Microsoft and Meta1, as well as integrations of learned performance predictors into AWS Redshift’s intelligent scaling and workload manager1.
The classic “How Good Are Query Optimizers, Really?” paper also makes an appearance, as the use of the join order benchmark (JOB) has become ubiquitous in query optimization research. Of course, this is also a seminal paper in query optimization!
DBMS engines
Rank | Title | Year | Non-self Citations |
Per Year |
---|---|---|---|---|
5 | Spark SQL: Relational Data Processing in Spark. | 2015 | 172 | 17 |
6 | Pregel: a system for large-scale graph processing. | 2010 | 214 | 14 |
10 | The Snowflake Elastic Data Warehouse. | 2016 | 102 | 11 |
11 | Hekaton: SQL server's memory-optimized OLTP engine. | 2013 | 138 | 11 |
13 | Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. | 2017 | 94 | 11 |
16 | DuckDB: an Embeddable Analytical Database. | 2019 | 65 | 10 |
18 | Efficiently Compiling Efficient Query Plans for Modern Hardware. | 2011 | 151 | 10 |
The execution model and engines of DBMSes will always be a core part of the database community’s interests. Recently, researchers have started to build on “new” foundations!
-
Papers building on Spark SQL continue to emphasize ease-of-use and compatibility with distributed and open source (mostly Apache) runtimes.
-
Papers building on Pregel continue to tackle graph database problems, which is far from solved – graph databases are interesting because they seem to warrant both a new query language and a new execution model.
-
Papers building on Hekaton continue to push the boundaries of in-memory data processing, often integrating PIM elements or pushing the performance envelope.
-
The database community’s obsession with clouds continues to show in works building off of Snowflake and Aurora.
Perhaps most notably, at least from my perspective, is the most-recent addition, DuckDB. For a long time, vectorized out-of-core2 analytics databases were only available as proprietary products (like Vertica) or locked away in TU Munich’s code repositories (like Hyper, which was eventually sold to Tableau, then Salesforce). DuckDB not only provides a state-of-the-art implementation of a vectorized database, but also packages it up as an easy-to-use, embedded tool. As a result, DuckDB has become the de facto testbed for implementing new OLAP techniques, like new research on join ordering or confidential computing.
ML-powered system tuning
Rank | Title | Year | Non-self Citations |
Per Year |
---|---|---|---|---|
8 | Automatic Database Management System Tuning Through Large-scale Machine Learning. | 2017 | 109 | 13 |
17 | An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. | 2019 | 60 | 10 |
Databases have knobs. Lots and lots of knobs. Whether it’s the classic
PostgreSQL shared_buffers
or something more obscure like Vertica’s resource
pools, getting the knobs set to the “right” values for your workload has always
been an essential performance tuning step.
The two hottest papers in this category both represent machine learning powered techniques, either using reinforcement learning or Bayesian optimization. The approaches building on these two works have made a lot of progress in both cloud and on-premises environments. Of particular note is the DB-BERT line of work from Immanuel Trummer’s lab at Cornell, which integrates LLMs into the already-ML heavy area, resulting in a database tuner that can, almost literally, “read the manual.”
Data cleaning and matching
Rank | Title | Year | Non-self Citations |
Per Year |
---|---|---|---|---|
9 | Deep Entity Matching with Pre-Trained Language Models. | 2020 | 66 | 13 |
14 | HoloClean: Holistic Data Repairs with Probabilistic Inference. | 2017 | 90 | 11 |
15 | Deep Learning for Entity Matching: A Design Space Exploration. | 2018 | 71 | 10 |
Database systems are only as useful as the data they contain! Garbage in, garbage out. Tools for data cleaning and entity matching are critical for bringing structure to chaos, enabling analysis and hopefully dredging up some of that good good Actionable Business Insight we love to talk about.
The prevailing trend, as you might imagine, is integrating LLMs and other recent ML inventions into traditional data cleaning and matching tools. While the latency of these systems still leaves something to be desired, the accuracy provided by these new LLM-enabled tools is sometimes orders of magnitude ahead of what we could do before.
Some good examples of this trend include Avanika Narayan’s work in Chris Re’s lab, which show that LLMs (“foundation models”) are quite effective at a number of data wrangling tasks. A slightly more involved approach is Nan Tang’s recent paper on relational pre-trained transformer (which is apparently almost all you need), which is directly trained for data cleaning tasks and can even operate on complex, non-1NF types like JSON (but we always put our databases into 3NF, right?).
What’s next?
Another year goes by and databases get a little faster, a little cleaner, a little easier to use, and a little smarter. Maybe next year the DataFusion team or the DuckDB team will make them a little cheaper! But nobody tell the VCs.
Perhaps you want to hear the no-doubt enlightened opinion of me, your humble
author? Well, I think smarter, faster, cheaper, and more accessible databases
are great, but how come nobody is looking into making databases more polite?
Look what happens if I misspell WHERE
:
imdb=# select count(*) from title were production_year=2022;
ERROR: syntax error at or near "production_year"
LINE 1: select count(*) from title were production_year=2022;
Frankly rude – no respect from PostgreSQL here. I guess if Mike Stonebraker raised me in a basement at Berkeley in the 80s, I might also be missing a few “pleases” and “thank yous” around the edges. Or maybe the middle is already hollow and the papers are all diarrhea, what do I know?
(If you actually want to know what I’ve been up to, check out our latest work on offline query optimization – SIGMOD papers coming soon!)