Data-Juicer: A One-Stop Data Processing System for Large Language Models
Summary: Data-Juicer: a one-stop data-processing system for constructing diverse data recipes to train and evaluate LLMs. Unique data-management features—fine-grained pipelines with 50+ operators, heterogeneous sources, visual auto-evaluation, and distributed-LLM integration—achieving up to 7.45% average gains across 16 benchmarks and 17.5% higher GPT-4 win-rate. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Daoyuan Chen
- 2. Yilun Huang
- 3. Zhijian Ma
- 4. Hesen Chen
- 5. Xuchen Pan
- 6. Ce Ge
- 7. Dawei Gao
- 8. Yuexiang Xie
- 9. Zhaoyang Liu
- 10. Jinyang Gao
- 11. Yaliang Li
- 12. Bolin Ding
- 13. Jingren Zhou
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,488 | Can Large Language Models Be Query Optimizer for Relational Databases? | 2026 | SIGMOD | 4.4998609e-05 |
| 10,183 | Mixtera: A Data Plane for Foundation Model Training | 2026 | SIGMOD | 4.1945683e-05 |
| 10,316 | LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning | 2026 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 0 of 0 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next