Ryan Marcus, assistant professor at the University of Pennsylvania. Using machine learning to build the next generation of data systems.
      
    ____                       __  ___                          
   / __ \__  ______ _____     /  |/  /___ _____________  _______
  / /_/ / / / / __ `/ __ \   / /|_/ / __ `/ ___/ ___/ / / / ___/
 / _, _/ /_/ / /_/ / / / /  / /  / / /_/ / /  / /__/ /_/ (__  ) 
/_/ |_|\__, /\__,_/_/ /_/  /_/  /_/\__,_/_/   \___/\__,_/____/  
      /____/                                                    
        
   ___                   __  ___                    
  / _ \__ _____ ____    /  |/  /__ ___________ _____
 / , _/ // / _ `/ _ \  / /|_/ / _ `/ __/ __/ // (_-<
/_/|_|\_, /\_,_/_//_/ /_/  /_/\_,_/_/  \__/\_,_/___/
     /___/                                          
        
   ___  __  ___                    
  / _ \/  |/  /__ ___________ _____
 / , _/ /|_/ / _ `/ __/ __/ // (_-<
/_/|_/_/  /_/\_,_/_/  \__/\_,_/___/                                   
        

2024's hottest topics in databases (a bibliometric approach)

From database companies to renowned professor of databaseology Andy Pavlo, everyone seems to be writing their “year in review” of the database world. While I could never match Andy’s wit nor his no-doubt earnest adoration for Larry Ellison, I thought I might be able to provide some additional insights into the hottest topics in database research from a bibliometric perspective.

For the past few years, I’ve maintained a “ranking” of database researchers and papers using the citation graph and PageRank. We can use the same data to identify papers with the fastest growing citation counts, which is arguably a reasonable proxy for “hotness.” Note that these papers are the papers being cited a lot, so they might represent the baseline for a particular area, or they might be the first paper in a line of work, and might not represent the actual “hotness” themselves.

Below, I’ve organized the 20 papers with the fastest growing non-self-citations.

Learned indexes

Rank Title Year Non-self
Citations
Per
Year
1 The Case for Learned Index Structures. 2018 142 20
12 ALEX: An Updatable Adaptive Learned Index. 2020 59 11

Since Tim Kraska’s paper on learned indexes (written during a sabbatical at Google, and preceding his subsequent move from Brown to MIT), the database community has been hard at work developing new learned indexes, benchmarking them, and implementing them into real systems. A recent survey paper from Purdue shows just how explosive the subject has become:

A genealogy tree of learned indexes

The ALEX paper was (to the best of my knowledge) the first paper to make learned indexes updatable, leveraging an adaptive tree structure. The first author, Jialin Ding, was a PhD student in Kraska’s lab, and is now faculty at Princeton. Jialin is also the led author on a paper describing how multi-dimensional learned indexes were integrated into AWS Redshift.

It’s been a while since Kraska’s paper made its initial splash, but (for a SIGMOD paper) it was surprisingly controversial. Blog posts about if we should throw out our algorithm textbooks or simply use B-trees were everywhere. In my opinion, part of this confusion was because the original SIGMOD paper did not come with any code. And, if you don’t believe in an idea, it’s very easy to whip up a crappy implementation. Fortunately, since 2018, we now have a much better idea why learned indexes work.

(Learned) query optimization

Rank Title Year Non-self
Citations
Per
Year
2 Learned Cardinalities: Estimating Correlated Joins with Deep Learning. 2019 124 20
3

Neo: A Learned Query Optimizer.1

2019 116 19
4 How Good Are Query Optimizers, Really? 2015 188 18
7

Bao: Making Learned Query Optimization Practical.1

2021 57 14
19 An End-to-End Learning-based Cost Estimator. 2019 61 10
20 Are We Ready For Learned Cardinality Estimation? 2021 42 10

Arguably since the 2001 IBM Leo paper, we’ve been trying to figure out how to build query optimizers that learn from their mistakes. After all, if the optimizer chooses a bad query plan, why shouldn’t it pick a better one next time?

The database community has been attacking this problem from (at least) three different directions, each of which is represented in the “hottest” papers:

  1. Using machine learning methods, like deep learning, to replace cardinality estimates,
  2. using semi-supervised or guided learning techniques to replace cost models,
  3. and using reinforcement learning to either replace or steer the query optimization process.

While academics continue to wax poetic about the merits of each approach, practitioners in industry also made some inroads with learned query optimization, including a deployment of “steered” optimizers at Microsoft and Meta1, as well as integrations of learned performance predictors into AWS Redshift’s intelligent scaling and workload manager1.

The classic “How Good Are Query Optimizers, Really?” paper also makes an appearance, as the use of the join order benchmark (JOB) has become ubiquitous in query optimization research. Of course, this is also a seminal paper in query optimization!

DBMS engines

Rank Title Year Non-self
Citations
Per
Year
5 Spark SQL: Relational Data Processing in Spark. 2015 172 17
6 Pregel: a system for large-scale graph processing. 2010 214 14
10 The Snowflake Elastic Data Warehouse. 2016 102 11
11 Hekaton: SQL server's memory-optimized OLTP engine. 2013 138 11
13 Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. 2017 94 11
16 DuckDB: an Embeddable Analytical Database. 2019 65 10
18 Efficiently Compiling Efficient Query Plans for Modern Hardware. 2011 151 10

The execution model and engines of DBMSes will always be a core part of the database community’s interests. Recently, researchers have started to build on “new” foundations!

Perhaps most notably, at least from my perspective, is the most-recent addition, DuckDB. For a long time, vectorized out-of-core2 analytics databases were only available as proprietary products (like Vertica) or locked away in TU Munich’s code repositories (like Hyper, which was eventually sold to Tableau, then Salesforce). DuckDB not only provides a state-of-the-art implementation of a vectorized database, but also packages it up as an easy-to-use, embedded tool. As a result, DuckDB has become the de facto testbed for implementing new OLAP techniques, like new research on join ordering or confidential computing.

ML-powered system tuning

Rank Title Year Non-self
Citations
Per
Year
8 Automatic Database Management System Tuning Through Large-scale Machine Learning. 2017 109 13
17 An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. 2019 60 10

Databases have knobs. Lots and lots of knobs. Whether it’s the classic PostgreSQL shared_buffers or something more obscure like Vertica’s resource pools, getting the knobs set to the “right” values for your workload has always been an essential performance tuning step.

The two hottest papers in this category both represent machine learning powered techniques, either using reinforcement learning or Bayesian optimization. The approaches building on these two works have made a lot of progress in both cloud and on-premises environments. Of particular note is the DB-BERT line of work from Immanuel Trummer’s lab at Cornell, which integrates LLMs into the already-ML heavy area, resulting in a database tuner that can, almost literally, “read the manual.”

Data cleaning and matching

Rank Title Year Non-self
Citations
Per
Year
9 Deep Entity Matching with Pre-Trained Language Models. 2020 66 13
14 HoloClean: Holistic Data Repairs with Probabilistic Inference. 2017 90 11
15 Deep Learning for Entity Matching: A Design Space Exploration. 2018 71 10

Database systems are only as useful as the data they contain! Garbage in, garbage out. Tools for data cleaning and entity matching are critical for bringing structure to chaos, enabling analysis and hopefully dredging up some of that good good Actionable Business Insight we love to talk about.

The prevailing trend, as you might imagine, is integrating LLMs and other recent ML inventions into traditional data cleaning and matching tools. While the latency of these systems still leaves something to be desired, the accuracy provided by these new LLM-enabled tools is sometimes orders of magnitude ahead of what we could do before.

Some good examples of this trend include Avanika Narayan’s work in Chris Re’s lab, which show that LLMs (“foundation models”) are quite effective at a number of data wrangling tasks. A slightly more involved approach is Nan Tang’s recent paper on relational pre-trained transformer (which is apparently almost all you need), which is directly trained for data cleaning tasks and can even operate on complex, non-1NF types like JSON (but we always put our databases into 3NF, right?).

What’s next?

Another year goes by and databases get a little faster, a little cleaner, a little easier to use, and a little smarter. Maybe next year the DataFusion team or the DuckDB team will make them a little cheaper! But nobody tell the VCs.

Perhaps you want to hear the no-doubt enlightened opinion of me, your humble author? Well, I think smarter, faster, cheaper, and more accessible databases are great, but how come nobody is looking into making databases more polite? Look what happens if I misspell WHERE:

imdb=# select count(*) from title were production_year=2022;
ERROR:  syntax error at or near "production_year"
LINE 1: select count(*) from title were production_year=2022;

Frankly rude – no respect from PostgreSQL here. I guess if Mike Stonebraker raised me in a basement at Berkeley in the 80s, I might also be missing a few “pleases” and “thank yous” around the edges. Or maybe the middle is already hollow and the papers are all diarrhea, what do I know?

(If you actually want to know what I’ve been up to, check out our latest work on offline query optimization – SIGMOD papers coming soon!)

  1. I am an author of this paper.  2 3 4

  2. MonetDB is arguably an exception to this claim, since MonetDB could mmap larger-than-memory chunks, but I think it is reasonable to categorize MonetDB as a mostly in-memory analytics database.