Pollock: A Data Loading Benchmark
Summary: Introduces Pollock, a benchmark and formal pollution model to generate realistic non‑standard CSV dialects and structural corruptions based on a survey of real-world files. Uses this framework to evaluate robustness of 16 parsing, DB, spreadsheet and visualization systems. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Gerardo Vitagliano
- 2. Mazhar Hameed
- 3. Lan Jiang
- 4. Lucas Reisener
- 5. Eugene Wu
- 6. Felix Naumann
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,587 | Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks | 2024 | SIGMOD | 8.4924618e-05 |
| 5,928 | SchemaPile: A Large Collection of Relational Database Schemas | 2024 | SIGMOD | 5.2685946e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 7 of 7 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,437 | Speculative Distributed CSV Data Parsing for Big Data Analytics | 2019 | SIGMOD | 7.0942161e-05 |
| 3,963 | Pytheas: Pattern-based Table Discovery in CSV Files | 2020 | VLDB | 6.5840643e-05 |
| 5,114 | TPC-DI: The First Industry Benchmark for Data Integration | 2014 | VLDB | 5.6863051e-05 |
| 5,242 | Towards Benchmarking Feature Type Inference for AutoML Platforms | 2021 | SIGMOD | 5.6074743e-05 |
| 6,846 | A framework for annotating CSV-like data | 2016 | VLDB | 4.9092462e-05 |
| 8,121 | Automation of Data Prep, ML, and Data Science: New Cure or Snake Oil? | 2021 | SIGMOD | 4.5809305e-05 |
| 11,420 | Detecting Layout Templates in Complex Multiregion Files | 2022 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,517 | Annotating Columns with Pre-trained Language Models | 2022 | SIGMOD | 8.6092139e-05 |
| 10,221 | NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions | 2026 | VLDB | 4.1945683e-05 |
| 369 | Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation | 2024 | VLDB | 0.0002547515 |
| 2,433 | ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems | 2024 | VLDB | 8.8285962e-05 |
| 2,322 | Instant Loading for Main Memory Databases | 2013 | VLDB | 9.034874e-05 |
| 8,007 | A Grammar-based Entity Representation Framework for Data Cleaning | 2009 | SIGMOD | 4.6068018e-05 |
| 1,343 | NoDB: Efficient Query Execution on Raw Data Files | 2012 | SIGMOD | 0.00012482538 |
| 5,353 | An In-Depth Benchmarking of Text-to-SQL Systems | 2021 | SIGMOD | 5.5521332e-05 |
| 3,437 | Speculative Distributed CSV Data Parsing for Big Data Analytics | 2019 | SIGMOD | 7.0942161e-05 |
| 6,846 | A framework for annotating CSV-like data | 2016 | VLDB | 4.9092462e-05 |