Back to papers
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning
Summary: CtxPipe automates context-aware data-prep pipeline construction for ML using pretrained embeddings to capture semantics and guide component choice. A deep RL framework searches the pipeline, delivering higher feature quality and faster models.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 7000
- Venue
- SIGMOD
- Year
- 2024
- Pagerank
- 4.456315e-05
- Overall Rank
- 8,743 | 39.18%
- DOI
-
10.1145/3698831
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 1 of 1 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 24 of 24 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 791 |
ActiveClean: Interactive Data Cleaning For Statistical Modeling |
2016 |
VLDB |
0.00016629664 |
| 921 |
Democratizing Data Science through Interactive Curation of ML Pipelines |
2019 |
SIGMOD |
0.00015337438 |
| 1,047 |
Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms |
2015 |
VLDB |
0.00014459715 |
| 1,612 |
Detecting Data Errors: Where are we and what needs to be done? |
2016 |
VLDB |
0.00011142794 |
| 1,627 |
Data Cleaning: Overview and Emerging Challenges |
2016 |
SIGMOD |
0.00011086905 |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 2,253 |
Efficient Denial Constraint Discovery with Hydra |
2018 |
VLDB |
9.1937209e-05 |
| 2,302 |
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions |
2021 |
VLDB |
9.0668832e-05 |
| 2,349 |
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation |
2021 |
VLDB |
8.9876423e-05 |
| 2,456 |
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities |
2021 |
SIGMOD |
8.7733773e-05 |
| 3,105 |
Data X-Ray: A Diagnostic Tool for Data Errors |
2015 |
SIGMOD |
7.5568954e-05 |
| 3,440 |
Approximate Denial Constraints |
2020 |
VLDB |
7.0918817e-05 |
| 3,467 |
Data Profiling – A Tutorial |
2017 |
SIGMOD |
7.069081e-05 |
| 4,682 |
Scalable Discovery of Unique Column Combinations |
2014 |
VLDB |
6.0022412e-05 |
| 5,192 |
Pattern Functional Dependencies for Data Cleaning |
2020 |
VLDB |
5.6375087e-05 |
| 5,429 |
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data |
2023 |
SIGMOD |
5.5087325e-05 |
| 6,437 |
Fundamentals of Order Dependencies |
2012 |
VLDB |
5.0631488e-05 |
| 6,944 |
DataPrism: Exposing Disconnect between Data and Systems |
2022 |
SIGMOD |
4.8912787e-05 |
| 7,202 |
Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems |
2021 |
SIGMOD |
4.8023314e-05 |
| 7,719 |
WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model |
2022 |
VLDB |
4.6686188e-05 |
| 8,092 |
Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications |
2023 |
SIGMOD |
4.587921e-05 |
| 8,341 |
BugDoc: Algorithms to Debug Computational Processes |
2020 |
SIGMOD |
4.5433282e-05 |
| 8,828 |
HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation |
2023 |
SIGMOD |
4.4407488e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 3,698 |
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines |
2022 |
SIGMOD |
6.8340435e-05 |
| 8,257 |
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines |
2023 |
SIGMOD |
4.5487511e-05 |
| 6,291 |
Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines |
2021 |
CIDR |
5.1269764e-05 |
| 5,429 |
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data |
2023 |
SIGMOD |
5.5087325e-05 |
| 5,304 |
A Scalable AutoML Approach Based on Graph Neural Networks |
2022 |
VLDB |
5.5779335e-05 |
| 11,549 |
Active Reinforcement Learning for Data Preparation: Learn2Clean with Human-In-The-Loop |
2020 |
CIDR |
4.1945683e-05 |
| 13,098 |
Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines |
2025 |
SIGMOD |
- |
| 10,628 |
CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines |
2025 |
VLDB |
4.1945683e-05 |
| 5,383 |
Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search |
2021 |
VLDB |
5.5393038e-05 |
| 8,828 |
HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation |
2023 |
SIGMOD |
4.4407488e-05 |