Back to papers
CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines
Summary: CatDB leverages data-catalog metadata to produce dataset-specific instructions that steer LLMs to synthesize end-to-end data-centric ML pipelines (cleaning, augmentation, feature engineering, tuning). Built-in validation and error handling reduce hallucinations and yield AutoML-comparable accuracy with much faster scaling.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13907
- Venue
- VLDB
- Year
- 2025
- Pagerank
- 4.1945683e-05
- Overall Rank
- 10,628 | 26.07%
- DOI
-
10.14778/3742728.3742754
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
Outgoing Citations (Sorted by Pagerank)
Showing 21 of 21 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 168 |
MAD Skills: New Analysis Practices for Big Data |
2009 |
VLDB |
0.00038946305 |
| 369 |
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation |
2024 |
VLDB |
0.0002547515 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 791 |
ActiveClean: Interactive Data Cleaning For Statistical Modeling |
2016 |
VLDB |
0.00016629664 |
| 921 |
Democratizing Data Science through Interactive Curation of ML Pipelines |
2019 |
SIGMOD |
0.00015337438 |
| 1,482 |
Automating Large-Scale Data Quality Verification |
2018 |
VLDB |
0.00011725533 |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 2,302 |
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions |
2021 |
VLDB |
9.0668832e-05 |
| 3,000 |
SANTOS: Relationship-based Semantic Table Union Search |
2023 |
SIGMOD |
7.7462128e-05 |
| 3,467 |
Data Profiling – A Tutorial |
2017 |
SIGMOD |
7.069081e-05 |
| 3,824 |
Correlation Sketches for Approximate Join-Correlation Queries |
2021 |
SIGMOD |
6.7260705e-05 |
| 4,769 |
Automated Feature Engineering for Algorithmic Fairness |
2021 |
VLDB |
5.934329e-05 |
| 4,774 |
LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems |
2021 |
SIGMOD |
5.9316087e-05 |
| 5,242 |
Towards Benchmarking Feature Type Inference for AutoML Platforms |
2021 |
SIGMOD |
5.6074743e-05 |
| 5,304 |
A Scalable AutoML Approach Based on Graph Neural Networks |
2022 |
VLDB |
5.5779335e-05 |
| 6,553 |
How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses |
2024 |
VLDB |
5.0157344e-05 |
| 8,092 |
Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications |
2023 |
SIGMOD |
4.587921e-05 |
| 8,208 |
SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions |
2024 |
CIDR |
4.5581306e-05 |
| 9,348 |
GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models |
2024 |
SIGMOD |
4.3526427e-05 |
| 9,610 |
Searching a Database of Source Codes Using Contextualized Code Search |
2020 |
VLDB |
4.3177432e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 9,476 |
Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents |
2025 |
SIGMOD |
4.3341665e-05 |
| 8,736 |
Unveiling Challenges for LLMs in Enterprise Data Engineering |
2026 |
VLDB |
4.456315e-05 |
| 9,243 |
Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models |
2024 |
VLDB |
4.3690661e-05 |
| 10,142 |
AutoDDG: Automated Dataset Description Generation using Large Language Models |
2026 |
SIGMOD |
4.1945683e-05 |
| 8,974 |
DataLoom: Simplifying Data Loading with LLMs |
2024 |
VLDB |
4.4184286e-05 |
| 6,389 |
Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs |
2024 |
VLDB |
5.0844009e-05 |
| 8,743 |
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning |
2024 |
SIGMOD |
4.456315e-05 |
| 7,020 |
LLM for Data Management |
2024 |
VLDB |
4.8595728e-05 |
| 10,316 |
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning |
2026 |
VLDB |
4.1945683e-05 |
| 13,098 |
Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines |
2025 |
SIGMOD |
- |