Database Paper Browser

Back to papers

CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines

Summary: CatDB leverages data-catalog metadata to produce dataset-specific instructions that steer LLMs to synthesize end-to-end data-centric ML pipelines (cleaning, augmentation, feature engineering, tuning). Built-in validation and error handling reduce hallucinations and yield AutoML-comparable accuracy with much faster scaling. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13907
Venue
VLDB
Year
2025
Pagerank
4.1945683e-05
Overall Rank
10,628 | 26.07%
DOI
10.14778/3742728.3742754

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank Citing Paper Year Venue Pagerank
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 21 of 21 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
168 MAD Skills: New Analysis Practices for Big Data 2009 VLDB 0.00038946305
369 Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation 2024 VLDB 0.0002547515
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
610 Goods: Organizing Google's Datasets 2016 SIGMOD 0.00019232674
791 ActiveClean: Interactive Data Cleaning For Statistical Modeling 2016 VLDB 0.00016629664
921 Democratizing Data Science through Interactive Curation of ML Pipelines 2019 SIGMOD 0.00015337438
1,482 Automating Large-Scale Data Quality Verification 2018 VLDB 0.00011725533
2,122 SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle 2020 CIDR 9.4989076e-05
2,302 Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions 2021 VLDB 9.0668832e-05
3,000 SANTOS: Relationship-based Semantic Table Union Search 2023 SIGMOD 7.7462128e-05
3,467 Data Profiling – A Tutorial 2017 SIGMOD 7.069081e-05
3,824 Correlation Sketches for Approximate Join-Correlation Queries 2021 SIGMOD 6.7260705e-05
4,769 Automated Feature Engineering for Algorithmic Fairness 2021 VLDB 5.934329e-05
4,774 LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems 2021 SIGMOD 5.9316087e-05
5,242 Towards Benchmarking Feature Type Inference for AutoML Platforms 2021 SIGMOD 5.6074743e-05
5,304 A Scalable AutoML Approach Based on Graph Neural Networks 2022 VLDB 5.5779335e-05
6,553 How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses 2024 VLDB 5.0157344e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,208 SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions 2024 CIDR 4.5581306e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
9,610 Searching a Database of Source Codes Using Contextualized Code Search 2020 VLDB 4.3177432e-05
Previous Page 1 / 1 Next

Semantically Similar Papers