Database Paper Browser

Back to papers

Spark SQL: Relational Data Processing in Spark

Summary: Relational processing integrated into Spark via DataFrame API, unifying SQL queries with Spark's functional workflow. Catalyst, a Scala-based extensible optimizer, enables composable rules, code generation, JSON schema inference, and federation to databases. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5022
Venue
SIGMOD
Year
2015
Pagerank
0.00061639801
Overall Rank
66 | 99.55%
DOI
10.1145/2723372.2742797

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 206 citing papers.

Rank Citing Paper Year Venue Pagerank
329 Accelerating Machine Learning Inference with Probabilistic Predicates 2018 SIGMOD 0.00027249545
544 Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources 2018 SIGMOD 0.00020521965
736 AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data 2020 VLDB 0.00017447617
746 Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores 2020 VLDB 0.00017326979
910 NeuroCard: One Cardinality Estimator for All Tables 2021 VLDB 0.00015423056
943 Wander Join: Online Aggregation via Random Walks 2016 SIGMOD 0.00015145883
1,323 Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters 2016 SIGMOD 0.00012601997
1,369 Random Sampling over Joins Revisited 2018 SIGMOD 0.00012339777
1,377 Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics 2021 CIDR 0.00012296941
1,427 Towards Scalable Dataframe Systems 2020 VLDB 0.0001204248
1,435 Simba: Efficient In-Memory Spatial Analytics 2016 SIGMOD 0.00012004456
1,482 Automating Large-Scale Data Quality Verification 2018 VLDB 0.00011725533
1,548 Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark 2018 SIGMOD 0.00011431383
1,574 Approximate Query Processing: No Silver Bullet 2017 SIGMOD 0.00011287495
1,666 HELIX: Holistic Optimization for Accelerating Iterative Machine Learning 2019 VLDB 0.0001096361
1,750 Weld: A Common Runtime for High Performance Data Analytics 2017 CIDR 0.00010683647
1,792 Hybrid Transactional/Analytical Processing: A Survey 2017 SIGMOD 0.00010537893
1,882 Tuplex: Data Science in Python at Native Code Speed 2021 SIGMOD 0.0001021625
1,943 Procella: Unifying serving and analytical data at YouTube 2019 VLDB 0.00010012569
2,027 Titian: Data Provenance Support in Spark 2016 VLDB 9.7437067e-05
2,099 Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries 2018 VLDB 9.5479391e-05
2,154 DIFF: A Relational Interface for Large-Scale Data Explanation 2019 VLDB 9.4208667e-05
2,192 DITA: Distributed In-Memory Trajectory Analytics 2018 SIGMOD 9.3185895e-05
2,267 ModelarDB: Modular Model-Based Time Series Management with Spark and Cassandra 2018 VLDB 9.1519895e-05
2,383 How to Architect a Query Compiler 2016 SIGMOD 8.9294108e-05
2,473 Photon: A Fast Query Engine for Lakehouse Systems 2022 SIGMOD 8.7237281e-05
2,501 DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models 2019 SIGMOD 8.6453446e-05
2,545 POLARIS: The Distributed SQL Engine in Azure Synapse 2020 VLDB 8.5725413e-05
2,588 Database Learning: Toward a Database that Becomes Smarter Every Time 2017 SIGMOD 8.4909562e-05
2,700 Filter Before You Parse: Faster Analytics on Raw Data with Sparser 2018 VLDB 8.2728509e-05
2,762 FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation 2021 VLDB 8.1585394e-05
2,772 Quickstep: A Data Platform Based on the Scaling-Up Approach 2018 VLDB 8.1401661e-05
2,819 Mison: A Fast JSON Parser for Data Analytics 2017 VLDB 8.0651326e-05
2,838 How to Architect a Query Compiler, Revisited 2018 SIGMOD 8.0408472e-05
2,896 Evaluating End-to-End Optimization for Data Analytics Applications in Weld 2018 VLDB 7.9452051e-05
2,910 DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine 2016 SIGMOD 7.9266529e-05
2,919 RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark 2019 SIGMOD 7.9047279e-05
2,934 AIDA - Abstraction for Advanced In-Database Analytics 2018 VLDB 7.8595778e-05
2,965 SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment 2016 SIGMOD 7.8059273e-05
3,023 Helix: Accelerating Human-in-the-loop Machine Learning 2018 VLDB 7.6929986e-05
3,058 Rethinking Data-Intensive Science Using Scalable Analytics Systems 2015 SIGMOD 7.6410159e-05
3,152 AnalyticDB: Real-time OLAP Database System at Alibaba Cloud 2019 VLDB 7.4711766e-05
3,200 Big Data Analytics with Datalog Queries on Spark 2016 SIGMOD 7.3912411e-05
3,355 F1 Query: Declarative Querying at Scale 2018 VLDB 7.1829142e-05
3,407 End-to-end Optimization of Machine Learning Prediction Queries 2022 SIGMOD 7.1295646e-05
3,437 Speculative Distributed CSV Data Parsing for Big Data Analytics 2019 SIGMOD 7.0942161e-05
3,535 Scaling Spark in the Real World: Performance and Usability 2015 VLDB 6.9992495e-05
3,550 Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems 2018 VLDB 6.9843512e-05
3,571 Lightning Fast and Space Efficient Inequality Joins 2015 VLDB 6.9580858e-05
3,628 OceanBase: A 707 Million tpmC Distributed Relational Database System 2022 VLDB 6.9031596e-05
Previous Page 1 / 5 Next

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers