Back to papers
Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
Summary: Auto-Suggest learns to propose data prep steps by mining notebook-driven data manipulations. Crawls 4M GitHub Jupyter notebooks, replays steps to log inputs/outputs and decisions, using logs to learn data-driven prep recommendations that beat baselines.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 5953
- Venue
- SIGMOD
- Year
- 2020
- Pagerank
- 7.3178277e-05
- Overall Rank
- 3,252 | 77.38%
- DOI
-
10.1145/3318464.3389738
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 17 of 17 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 3,015 |
Chorus: Foundation Models for Unified Data Discovery and Exploration |
2024 |
VLDB |
7.7092391e-05 |
| 5,275 |
Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples |
2023 |
VLDB |
5.5905507e-05 |
| 5,280 |
Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V |
2023 |
VLDB |
5.5896735e-05 |
| 5,383 |
Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search |
2021 |
VLDB |
5.5393038e-05 |
| 6,409 |
Fine-Grained Lineage for Safer Notebook Interactions |
2021 |
VLDB |
5.0756653e-05 |
| 6,895 |
Decentralized Actor Scheduling and Reference-based Storage in Xorbits: a Native Scalable Data Science Engine |
2025 |
VLDB |
4.8925595e-05 |
| 8,388 |
FEDEX: An Explainability Framework for Data Exploration Steps |
2022 |
VLDB |
4.5297787e-05 |
| 8,645 |
Predicate Pushdown for Data Science Pipelines |
2023 |
SIGMOD |
4.4772518e-05 |
| 8,828 |
HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation |
2023 |
SIGMOD |
4.4407488e-05 |
| 9,371 |
Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations |
2024 |
SIGMOD |
4.3480692e-05 |
| 9,490 |
Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph |
2023 |
VLDB |
4.3341665e-05 |
| 10,152 |
Data-Semantics-Aware Recommendation of Diverse Pivot Tables |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,168 |
FlowPilot: A Suggestion System for Designing Scientific Workflows |
2026 |
SIGMOD |
4.1945683e-05 |
| 11,063 |
Searching Data Lakes for Nested and Joined Data |
2024 |
VLDB |
4.1945683e-05 |
| 11,103 |
LucidScript: Bottom-up Standardization for Data Preparation |
2024 |
VLDB |
4.1945683e-05 |
| 11,297 |
DataRinse: Semantic Transforms for Data preparation based on Code Mining |
2023 |
VLDB |
4.1945683e-05 |
| 11,429 |
Leam: An Interactive System for In-situ Visual Text Analysis |
2021 |
CIDR |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 25 of 25 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 98 |
XMark: A Benchmark for XML Data Management |
2002 |
VLDB |
0.00050023808 |
| 475 |
Mining Database Structure; Or, How to Build a Data Quality Browser |
2002 |
SIGMOD |
0.00022303253 |
| 600 |
Linear Road: A Stream Data Management Benchmark |
2004 |
VLDB |
0.0001938744 |
| 1,009 |
SnipSuggest: Context-Aware Autocompletion for SQL |
2011 |
VLDB |
0.00014653644 |
| 1,267 |
Foofah: Transforming Data By Example |
2017 |
SIGMOD |
0.00012936483 |
| 1,277 |
The Data Civilizer System |
2017 |
CIDR |
0.00012879695 |
| 1,317 |
Harvesting Relational Tables from Lists on the Web |
2009 |
VLDB |
0.00012625853 |
| 1,337 |
HoloDetect: Few-Shot Learning for Error Detection |
2019 |
SIGMOD |
0.00012497164 |
| 1,469 |
BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations |
2016 |
VLDB |
0.00011836053 |
| 1,482 |
Automating Large-Scale Data Quality Verification |
2018 |
VLDB |
0.00011725533 |
| 1,612 |
Detecting Data Errors: Where are we and what needs to be done? |
2016 |
VLDB |
0.00011142794 |
| 1,664 |
On Multi-Column Foreign Key Discovery |
2010 |
VLDB |
0.00010976887 |
| 2,097 |
Predictive Interaction for Data Transformation |
2015 |
CIDR |
9.5489822e-05 |
| 2,158 |
Uni-Detect: A Unified Approach to Automated Error Detection in Tables |
2019 |
SIGMOD |
9.4141354e-05 |
| 2,506 |
Auto-Detect: Data-Driven Error Detection in Tables |
2018 |
SIGMOD |
8.6335464e-05 |
| 3,299 |
SCODED: Statistical Constraint Oriented Data Error Detection |
2020 |
SIGMOD |
7.2546659e-05 |
| 3,478 |
Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations |
2018 |
VLDB |
7.054159e-05 |
| 3,690 |
Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets |
2018 |
SIGMOD |
6.8384476e-05 |
| 3,735 |
Auto-Join: Joining Tables by Leveraging Transformations |
2017 |
VLDB |
6.8061318e-05 |
| 3,742 |
TEGRA: Table Extraction by Global Record Alignment |
2015 |
SIGMOD |
6.7966898e-05 |
| 4,850 |
SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora |
2015 |
VLDB |
5.8768452e-05 |
| 5,486 |
Fast Foreign-Key Detection in Microsoft SQL Server PowerPivot for Excel |
2014 |
VLDB |
5.4811603e-05 |
| 6,195 |
WADaR: Joint Wrapper and Data Repair |
2015 |
VLDB |
5.1618114e-05 |
| 6,697 |
The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS |
2005 |
VLDB |
4.9577992e-05 |
| 8,499 |
Synthesizing Mapping Relationships Using Table Corpus |
2017 |
SIGMOD |
4.4975851e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 4,173 |
Automatic Example Queries for Ad Hoc Databases |
2011 |
SIGMOD |
6.3874627e-05 |
| 11,549 |
Active Reinforcement Learning for Data Preparation: Learn2Clean with Human-In-The-Loop |
2020 |
CIDR |
4.1945683e-05 |
| 10,682 |
AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework |
2025 |
VLDB |
4.1945683e-05 |
| 10,152 |
Data-Semantics-Aware Recommendation of Diverse Pivot Tables |
2026 |
SIGMOD |
4.1945683e-05 |
| 9,490 |
Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph |
2023 |
VLDB |
4.3341665e-05 |
| 5,096 |
Auto-Transform: Learning-to-Transform by Patterns |
2020 |
VLDB |
5.7011825e-05 |
| 1,644 |
Finding Related Tables in Data Lakes for Interactive Data Science |
2020 |
SIGMOD |
0.00011041787 |
| 13,291 |
Towards Understanding Data Analysis Workflows using a Large Notebook Corpus |
2019 |
SIGMOD |
- |
| 5,383 |
Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search |
2021 |
VLDB |
5.5393038e-05 |
| 10,598 |
Auto-Prep: Holistic Prediction of Data Preparation Steps for Self-Service Business Intelligence |
2025 |
VLDB |
4.1945683e-05 |