Back to papers
Towards Scalable Dataframe Systems
Summary: Scalable dataframe systems via MODIN; scaling pandas-like APIs with a simple dataframe data model and algebra. Signature features: flexible schemas, ordering, row/column equivalence, data/metadata fluidity; a trial-and-error interaction model spurs open data-management research.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12097
- Venue
- VLDB
- Year
- 2020
- Pagerank
- 0.0001204248
- Overall Rank
- 1,427 | 90.08%
- DOI
-
10.14778/3407790.3407807
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 21 of 21 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 2,121 |
Balsa: Learning a Query Optimizer Without Expert Demonstrations |
2022 |
SIGMOD |
9.5017232e-05 |
| 2,954 |
Magpie: Python at Speed and Scale using Cloud Backends |
2021 |
CIDR |
7.8262582e-05 |
| 3,254 |
Query Processing on Tensor Computation Runtimes |
2022 |
VLDB |
7.3161051e-05 |
| 3,393 |
Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows |
2022 |
VLDB |
7.1483239e-05 |
| 3,763 |
Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System |
2022 |
VLDB |
6.7801795e-05 |
| 4,239 |
The Composable Data Management System Manifesto |
2023 |
VLDB |
6.3318452e-05 |
| 4,773 |
PolyFrame: A Retargetable Query-based Approach to Scaling Dataframes |
2021 |
VLDB |
5.9320139e-05 |
| 5,307 |
A Critique of Modern SQL And A Proposal Towards A Simple and Expressive Query Language |
2024 |
CIDR |
5.5766594e-05 |
| 5,981 |
DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python |
2021 |
SIGMOD |
5.2448986e-05 |
| 6,541 |
ConnectorX: Accelerating Data Loading From Databases to Dataframes |
2022 |
VLDB |
5.0216945e-05 |
| 6,895 |
Decentralized Actor Scheduling and Reference-based Storage in Xorbits: a Native Scalable Data Science Engine |
2025 |
VLDB |
4.8925595e-05 |
| 8,163 |
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science |
2021 |
VLDB |
4.5723431e-05 |
| 8,257 |
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines |
2023 |
SIGMOD |
4.5487511e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 8,915 |
DQDF: Data-Quality-Aware Dataframes |
2022 |
VLDB |
4.427232e-05 |
| 9,912 |
ElasticNotebook: Enabling Live Migration for Computational Notebooks |
2024 |
VLDB |
4.2565279e-05 |
| 10,482 |
Fast and Scalable Data Transfer Across Data Systems |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,591 |
Accio: Bolt-on Query Federation |
2025 |
VLDB |
4.1945683e-05 |
| 11,024 |
SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis |
2024 |
VLDB |
4.1945683e-05 |
| 11,396 |
DPDS: Assisting Data Science with Data Provenance |
2022 |
VLDB |
4.1945683e-05 |
| 11,429 |
Leam: An Interactive System for In-situ Visual Text Analysis |
2021 |
CIDR |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 24 of 24 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 14 |
Online Aggregation |
1997 |
SIGMOD |
0.0010801504 |
| 66 |
Spark SQL: Relational Data Processing in Spark |
2015 |
SIGMOD |
0.00061639801 |
| 112 |
Potter's Wheel: An Interactive Data Cleaning System |
2001 |
VLDB |
0.00047045036 |
| 179 |
Efficient and Extensible Algorithms for Multi Query Optimization |
2000 |
SIGMOD |
0.00037672155 |
| 185 |
DuckDB: an Embeddable Analytical Database |
2019 |
SIGMOD |
0.00036538405 |
| 515 |
QPipe: A Simultaneously Pipelined Relational Query Engine |
2005 |
SIGMOD |
0.00021214633 |
| 940 |
SharedDB: Killing One Thousand Queries With One Stone |
2012 |
VLDB |
0.00015173166 |
| 1,203 |
PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS |
2004 |
VLDB |
0.00013320373 |
| 1,204 |
VerdictDB: Universalizing Approximate Query Processing |
2018 |
SIGMOD |
0.00013319541 |
| 1,219 |
Rate-Based Query Optimization for Streaming Information Sources |
2002 |
SIGMOD |
0.00013223888 |
| 1,233 |
Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources |
2003 |
VLDB |
0.0001313363 |
| 1,383 |
Querying XML Views of Relational Data |
2001 |
VLDB |
0.00012270434 |
| 1,422 |
SchemaSQL - A Language for Interoperability in Relational Multi-database Systems |
1996 |
VLDB |
0.00012056887 |
| 1,666 |
HELIX: Holistic Optimization for Accelerating Iterative Machine Learning |
2019 |
VLDB |
0.0001096361 |
| 1,900 |
Hash joins and hash teams in Microsoft SQL Server |
1998 |
VLDB |
0.000101645 |
| 2,011 |
Rapid Sampling for Visualizations with Ordering Guarantees |
2015 |
VLDB |
9.7964875e-05 |
| 2,097 |
Predictive Interaction for Data Transformation |
2015 |
CIDR |
9.5489822e-05 |
| 2,365 |
The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing |
2014 |
SIGMOD |
8.9551432e-05 |
| 2,580 |
Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee |
2016 |
SIGMOD |
8.5058814e-05 |
| 4,681 |
Adaptive Sampling for Rapidly Matching Histograms |
2018 |
VLDB |
6.0034918e-05 |
| 4,811 |
OQL: A Query Language for Manipulating Object-oriented Databases |
1989 |
VLDB |
5.9061974e-05 |
| 5,662 |
Query Unnesting in Object-Oriented Databases |
1998 |
SIGMOD |
5.3838456e-05 |
| 6,508 |
DataSpread: Unifying Databases and Spreadsheets |
2015 |
VLDB |
5.0335028e-05 |
| 6,822 |
Skimmer: Rapid Scrolling of Relational Query Results |
2012 |
SIGMOD |
4.9152454e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 11,024 |
SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis |
2024 |
VLDB |
4.1945683e-05 |
| 5,981 |
DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python |
2021 |
SIGMOD |
5.2448986e-05 |
| 4,003 |
Data Platform for Machine Learning |
2019 |
SIGMOD |
6.54347e-05 |
| 11,288 |
To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks |
2023 |
VLDB |
4.1945683e-05 |
| 2,954 |
Magpie: Python at Speed and Scale using Cloud Backends |
2021 |
CIDR |
7.8262582e-05 |
| 9,416 |
When sweet and cute isn't enough anymore: Solving scalability issues in Python Pandas with Grizzly |
2020 |
CIDR |
4.3441378e-05 |
| 4,813 |
Putting Pandas in a Box |
2021 |
CIDR |
5.9049746e-05 |
| 8,915 |
DQDF: Data-Quality-Aware Dataframes |
2022 |
VLDB |
4.427232e-05 |
| 4,773 |
PolyFrame: A Retargetable Query-based Approach to Scaling Dataframes |
2021 |
VLDB |
5.9320139e-05 |
| 3,763 |
Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System |
2022 |
VLDB |
6.7801795e-05 |