DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python
Summary: DataPrep.EDA is a task-centric, declarative EDA system in Python that lets researchers specify diverse EDA tasks with a single function call. Its Dask-backed pipelines scale the workflow, delivering faster, more usable EDA than Pandas-profiling; open-sourced as part of DataPrep. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Jinglin Peng
- 2. Weiyuan Wu
- 3. Brandon Lockhart
- 4. Song Bian
- 5. Jing Nathan Yan
- 6. Linghao Xu
- 7. Zhixuan Chi
- 8. Jeffrey M. Rzeszotarski
- 9. Jiannan Wang
Incoming Citations (Sorted by Pagerank)
Showing 6 of 6 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 5,509 | Can Large Language Models Predict Data Correlations from Column Names? | 2023 | VLDB | 5.4703368e-05 |
| 6,553 | How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses | 2024 | VLDB | 5.0157344e-05 |
| 10,512 | Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables | 2025 | SIGMOD | 4.1945683e-05 |
| 10,610 | Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation | 2025 | VLDB | 4.1945683e-05 |
| 10,682 | AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework | 2025 | VLDB | 4.1945683e-05 |
| 10,784 | Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 8 of 8 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 112 | Potter's Wheel: An Interactive Data Cleaning System | 2001 | VLDB | 0.00047045036 |
| 460 | SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics | 2015 | VLDB | 0.00022516069 |
| 1,427 | Towards Scalable Dataframe Systems | 2020 | VLDB | 0.0001204248 |
| 1,625 | Data Profiling with Metanome | 2015 | VLDB | 0.00011094926 |
| 2,993 | Foresight: Recommending Visual Insights | 2017 | VLDB | 7.7687088e-05 |
| 3,546 | Extracting Top-K Insights from Multi-dimensional Data | 2017 | SIGMOD | 6.9870745e-05 |
| 5,217 | QuickInsights: Quick and Automatic Discovery of Insights from Multi-Dimensional Data | 2019 | SIGMOD | 5.6227959e-05 |
| 7,364 | ExplainED: Explanations for EDA Notebooks | 2020 | VLDB | 4.7519211e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,915 | DQDF: Data-Quality-Aware Dataframes | 2022 | VLDB | 4.427232e-05 |
| 11,123 | PD-Explain: A Unified Python-native Framework for Query Explanations Over DataFrames | 2024 | VLDB | 4.1945683e-05 |
| 9,830 | Towards Autonomous, Hands-Free Data Exploration | 2020 | CIDR | 4.2751057e-05 |
| 11,288 | To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks | 2023 | VLDB | 4.1945683e-05 |
| 7,306 | DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines | 2022 | CIDR | 4.7678574e-05 |
| 4,540 | Automating Exploratory Data Analysis via Machine Learning: An Overview | 2020 | SIGMOD | 6.1033443e-05 |
| 11,515 | From Papers to Practice: The openclean Open-Source Data Cleaning Library | 2021 | VLDB | 4.1945683e-05 |
| 7,364 | ExplainED: Explanations for EDA Notebooks | 2020 | VLDB | 4.7519211e-05 |
| 3,878 | Data Canopy: Accelerating Exploratory Statistical Analysis | 2017 | SIGMOD | 6.6731435e-05 |
| 9,911 | Dias: Dynamic Rewriting of Pandas Code | 2024 | SIGMOD | 4.2565279e-05 |