Speculative Distributed CSV Data Parsing for Big Data Analytics
Summary: Speculative distributed CSV parsing aligns field/record boundaries across chunks without context to enable parallel parsing. Robust syntax-error detection; Spark tests on 11k real-world datasets show substantial performance gains over prior parsers. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Chang Ge
- 2. Yinan Li
- 3. Eric Eilebrecht
- 4. Badrish Chandramouli
- 5. Donald Kossmann
Incoming Citations (Sorted by Pagerank)
Showing 13 of 13 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 10 of 10 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 66 | Spark SQL: Relational Data Processing in Spark | 2015 | SIGMOD | 0.00061639801 |
| 109 | Dremel: Interactive Analysis of Web-Scale Datasets | 2010 | VLDB | 0.00048186983 |
| 1,343 | NoDB: Efficient Query Execution on Raw Data Files | 2012 | SIGMOD | 0.00012482538 |
| 2,322 | Instant Loading for Main Memory Databases | 2013 | VLDB | 9.034874e-05 |
| 2,367 | Here are my Data Files. Here are my Queries. Where are my Results? | 2011 | CIDR | 8.9511058e-05 |
| 2,700 | Filter Before You Parse: Faster Analytics on Raw Data with Sparser | 2018 | VLDB | 8.2728509e-05 |
| 2,757 | Parallel Data Analysis Directly on Scientific File Formats | 2014 | SIGMOD | 8.1679384e-05 |
| 2,819 | Mison: A Fast JSON Parser for Data Analytics | 2017 | VLDB | 8.0651326e-05 |
| 2,973 | Parallel In-Situ Data Processing with Speculative Loading | 2014 | SIGMOD | 7.7902322e-05 |
| 3,548 | Adaptive Query Processing on RAW Data | 2014 | VLDB | 6.9859242e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 7,794 | Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark | 2016 | VLDB | 4.6482977e-05 |
| 9,504 | Supporting Scalable Analytics with Latency Constraints | 2015 | VLDB | 4.3341665e-05 |
| 2,973 | Parallel In-Situ Data Processing with Speculative Loading | 2014 | SIGMOD | 7.7902322e-05 |
| 5,915 | Runtime-Extensible Parsers | 2025 | CIDR | 5.274713e-05 |
| 11,427 | Accelerating Complex Analytics using Speculation | 2021 | CIDR | 4.1945683e-05 |
| 3,200 | Big Data Analytics with Datalog Queries on Spark | 2016 | SIGMOD | 7.3912411e-05 |
| 6,846 | A framework for annotating CSV-like data | 2016 | VLDB | 4.9092462e-05 |
| 9,124 | Dynamic Speculative Optimizations for SQL Compilation in Apache Spark | 2020 | VLDB | 4.391961e-05 |
| 7,360 | ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data | 2020 | VLDB | 4.7525925e-05 |
| 2,700 | Filter Before You Parse: Faster Analytics on Raw Data with Sparser | 2018 | VLDB | 8.2728509e-05 |