Database Paper Browser

Back to papers

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Summary: Data-Juicer: a one-stop data-processing system for constructing diverse data recipes to train and evaluate LLMs. Unique data-management features—fine-grained pipelines with 50+ operators, heterogeneous sources, visual auto-evaluation, and distributed-LLM integration—achieving up to 7.45% average gains across 16 benchmarks and 17.5% higher GPT-4 win-rate. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
6793
Venue
SIGMOD
Year
2024
Pagerank
5.2725159e-05
Overall Rank
5,921 | 58.81%
DOI
10.1145/3626246.3653385

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 3 of 3 citing papers.

Rank Citing Paper Year Venue Pagerank
8,488 Can Large Language Models Be Query Optimizer for Relational Databases? 2026 SIGMOD 4.4998609e-05
10,183 Mixtera: A Data Plane for Foundation Model Training 2026 SIGMOD 4.1945683e-05
10,316 LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning 2026 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 0 of 0 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
Previous Page 1 / 1 Next

Semantically Similar Papers