LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Summary: LLM-agent framework for privacy-preserving, automatic data processing in fine-tuning: iteratively synthesizes/refines DP pipelines from prompts/feedback, avoiding raw-data inspection. Key accelerators: distribution-preserving sampling, low-quality target selection, cache-and-reuse; 10x faster search. (summarized by gpt-5.4-mini on Apr 12 2026)

Paper ID: 14373
Venue: VLDB
Year: 2026
Pagerank: 4.1905499e-05
Overall Rank: 10,328 | 28.22%
DOI: 10.14778/3796195.3796196

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank

Outgoing Citations (Sorted by Pagerank)

Showing 3 of 3 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
5,439	DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data	2023	SIGMOD	5.5034427e-05
5,924	Data-Juicer: A One-Stop Data Processing System for Large Language Models	2024	SIGMOD	5.2674873e-05
5,965	Automatic Data Acquisition for Deep Learning	2021	VLDB	5.2476363e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
3,803	Revisiting Prompt Engineering via Declarative Crowdsourcing	2024	CIDR	6.7498941e-05
11,061	LLM-PBE: Assessing Data Privacy in Large Language Models	2024	VLDB	4.1905499e-05
5,924	Data-Juicer: A One-Stop Data Processing System for Large Language Models	2024	SIGMOD	5.2674873e-05
1,839	DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing	2025	VLDB	0.00010351287
10,636	CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines	2025	VLDB	4.1905499e-05
10,690	AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework	2025	VLDB	4.1905499e-05
10,064	Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees	2026	SIGMOD	4.1905499e-05
13,112	Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines	2025	SIGMOD	-
9,222	Intelligent Agents for Data Exploration	2024	VLDB	4.366098e-05
7,016	LLM for Data Management	2024	VLDB	4.8561622e-05