Back to papers
Filter Before You Parse: Faster Analytics on Raw Data with Sparser
Summary: Raw filtering applies predicates to the raw bytestream before parsing, dramatically reducing parsing overhead. SIMD RF cascades with a lightweight optimizer let Sparser pick the best cascade per data/format (JSON/Avro/Parquet), delivering up to 22x parser and 9x end-to-end speedups.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 11643
- Venue
- VLDB
- Year
- 2018
- Pagerank
- 8.2728509e-05
- Overall Rank
- 2,700 | 81.22%
- DOI
-
10.14778/3236187.3236207
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 17 of 17 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 3,259 |
AS-Parser: Log Parsing Based on Adaptive Segmentation |
2023 |
SIGMOD |
7.3147783e-05 |
| 3,437 |
Speculative Distributed CSV Data Parsing for Big Data Analytics |
2019 |
SIGMOD |
7.0942161e-05 |
| 4,602 |
Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture |
2019 |
VLDB |
6.0567387e-05 |
| 4,704 |
JSON Tiles: Fast Analytics on Semi-Structured Data |
2021 |
SIGMOD |
5.9853687e-05 |
| 6,282 |
Cheetah: Accelerating Database Queries with Switch Pruning |
2020 |
SIGMOD |
5.128797e-05 |
| 7,360 |
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data |
2020 |
VLDB |
4.7525925e-05 |
| 7,427 |
Selection Pushdown in Column Stores using Bit Manipulation Instructions |
2023 |
SIGMOD |
4.7327406e-05 |
| 7,497 |
Stackless Processing of Streamed Trees |
2021 |
PODS |
4.7180617e-05 |
| 7,830 |
Scalable Structural Index Construction for JSON Analytics |
2021 |
VLDB |
4.6388763e-05 |
| 8,788 |
FishStore: Faster Ingestion with Subset Hashing |
2019 |
SIGMOD |
4.451039e-05 |
| 9,124 |
Dynamic Speculative Optimizations for SQL Compilation in Apache Spark |
2020 |
VLDB |
4.391961e-05 |
| 9,379 |
GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example |
2023 |
SIGMOD |
4.3462787e-05 |
| 9,837 |
GpJSON: High-performance JSON Data Processing on GPUs |
2025 |
VLDB |
4.2740344e-05 |
| 10,482 |
Fast and Scalable Data Transfer Across Data Systems |
2025 |
SIGMOD |
4.1945683e-05 |
| 11,150 |
Zed: Leveraging Data Types to Process Eclectic Data |
2023 |
CIDR |
4.1945683e-05 |
| 11,189 |
dsJSON: A Distributed SQL JSON Processor |
2023 |
SIGMOD |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 15 of 15 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 66 |
Spark SQL: Relational Data Processing in Spark |
2015 |
SIGMOD |
0.00061639801 |
| 1,043 |
Adaptive Ordering of Pipelined Stream Filters |
2004 |
SIGMOD |
0.00014476247 |
| 1,343 |
NoDB: Efficient Query Execution on Raw Data Files |
2012 |
SIGMOD |
0.00012482538 |
| 1,807 |
H2O: A Hands-free Adaptive Store |
2014 |
SIGMOD |
0.00010487796 |
| 2,001 |
Sinew: A SQL System for Multi-Structured Data |
2014 |
SIGMOD |
9.8186417e-05 |
| 2,322 |
Instant Loading for Main Memory Databases |
2013 |
VLDB |
9.034874e-05 |
| 2,367 |
Here are my Data Files. Here are my Queries. Where are my Results? |
2011 |
CIDR |
8.9511058e-05 |
| 2,819 |
Mison: A Fast JSON Parser for Data Analytics |
2017 |
VLDB |
8.0651326e-05 |
| 2,973 |
Parallel In-Situ Data Processing with Speculative Loading |
2014 |
SIGMOD |
7.7902322e-05 |
| 3,548 |
Adaptive Query Processing on RAW Data |
2014 |
VLDB |
6.9859242e-05 |
| 3,882 |
Micro Adaptivity in Vectorwise |
2013 |
SIGMOD |
6.6690423e-05 |
| 3,891 |
Slalom: Coasting Through Raw Data via Adaptive Partitioning and Indexing |
2017 |
VLDB |
6.659442e-05 |
| 4,326 |
Fast Queries Over Heterogeneous Data Through Engine Customization |
2016 |
VLDB |
6.288323e-05 |
| 6,407 |
Just-In-Time Data Virtualization: Lightweight Data Management with ViDa |
2015 |
CIDR |
5.076547e-05 |
| 7,738 |
AFilter: Adaptable XML Filtering with Prefix-Caching and Suffix-Clustering |
2006 |
VLDB |
4.6636747e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 9,124 |
Dynamic Speculative Optimizations for SQL Compilation in Apache Spark |
2020 |
VLDB |
4.391961e-05 |
| 7,427 |
Selection Pushdown in Column Stores using Bit Manipulation Instructions |
2023 |
SIGMOD |
4.7327406e-05 |
| 9,842 |
A four-dimensional Analysis of Partitioned Approximate Filters |
2021 |
VLDB |
4.2722447e-05 |
| 4,602 |
Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture |
2019 |
VLDB |
6.0567387e-05 |
| 2,819 |
Mison: A Fast JSON Parser for Data Analytics |
2017 |
VLDB |
8.0651326e-05 |
| 5,915 |
Runtime-Extensible Parsers |
2025 |
CIDR |
5.274713e-05 |
| 2,973 |
Parallel In-Situ Data Processing with Speculative Loading |
2014 |
SIGMOD |
7.7902322e-05 |
| 3,548 |
Adaptive Query Processing on RAW Data |
2014 |
VLDB |
6.9859242e-05 |
| 7,360 |
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data |
2020 |
VLDB |
4.7525925e-05 |
| 3,437 |
Speculative Distributed CSV Data Parsing for Big Data Analytics |
2019 |
SIGMOD |
7.0942161e-05 |