Spark SQL: Relational Data Processing in Spark

Summary: Relational processing integrated into Spark via DataFrame API, unifying SQL queries with Spark's functional workflow. Catalyst, a Scala-based extensible optimizer, enables composable rules, code generation, JSON schema inference, and federation to databases. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 5022
Venue: SIGMOD
Year: 2015
Pagerank: 0.00061639801
Overall Rank: 66 | 99.55%
DOI: 10.1145/2723372.2742797

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 206 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
6,388	Optimizing Data-intensive Systems in Disaggregated Data Centers with TELEPORT	2022	SIGMOD	5.0851841e-05
6,541	ConnectorX: Accelerating Data Loading From Databases to Dataframes	2022	VLDB	5.0216945e-05
6,590	Interactive Demonstration of Probabilistic Predicates	2018	SIGMOD	5.0010949e-05
6,658	Scalable Querying of Nested Data	2021	VLDB	4.9711629e-05
6,673	Incorporating Super-Operators in Big-Data Query Optimizers	2020	VLDB	4.966799e-05
6,715	Shared Foundations: Modernizing Meta's Data Lakehouse	2023	CIDR	4.9509939e-05
6,745	DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs	2019	SIGMOD	4.9417155e-05
6,759	AStream: Ad-hoc Shared Stream Processing	2019	SIGMOD	4.9352213e-05
6,784	SparkR: Scaling R Programs with Spark	2016	SIGMOD	4.9265155e-05
6,871	Towards General and Efficient Online Tuning for Spark	2023	VLDB	4.8997004e-05
6,993	Unit Testing Data with Deequ	2019	SIGMOD	4.8693227e-05
7,059	Adaptive and Robust Query Execution for Lakehouses at Scale	2024	VLDB	4.8477825e-05
7,060	SquirrelJoin: Network-Aware Distributed Join Processing with Lazy Partitioning	2017	VLDB	4.8465382e-05
7,067	JetScope: Reliable and Interactive Analytics at Cloud Scale	2015	VLDB	4.8440936e-05
7,207	Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data	2016	VLDB	4.800763e-05
7,237	CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning	2017	VLDB	4.7928651e-05
7,296	Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and Opportunities	2022	SIGMOD	4.7723197e-05
7,387	Bubble Execution: Resource-aware Reliable Analytics at Cloud Scale	2018	VLDB	4.7438193e-05
7,399	SmartBench: A Benchmark For Data Management In Smart Spaces	2020	VLDB	4.7410149e-05
7,427	Selection Pushdown in Column Stores using Bit Manipulation Instructions	2023	SIGMOD	4.7327406e-05
7,534	Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams	2022	VLDB	4.7180004e-05
7,599	Quill: Efficient, Transferable, and Rich Analytics at Scale	2016	VLDB	4.7003593e-05
7,704	ExDRa: Exploratory Data Science on Federated Raw Data	2021	SIGMOD	4.6733838e-05
7,723	Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded	2017	VLDB	4.6676712e-05
7,818	A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data	2017	VLDB	4.6434716e-05
7,905	S2RDF: RDF Querying with SPARQL on Spark	2016	VLDB	4.6211706e-05
7,907	Petabyte-Scale Row-Level Operations in Data Lakehouses	2024	VLDB	4.6205839e-05
7,925	Architecting a Query Compiler for Spatial Workloads	2020	SIGMOD	4.6153403e-05
7,953	Shasta: Interactive Reporting At Scale	2016	SIGMOD	4.613363e-05
8,002	Pangea: Monolithic Distributed Storage for Data Analytics	2019	VLDB	4.6088289e-05
8,075	AJoin: Ad-hoc Stream Joins at Scale	2020	VLDB	4.5917655e-05
8,130	Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald	2024	PODS	4.5784634e-05
8,197	SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft	2021	VLDB	4.5607121e-05
8,230	You Say 'What', I Hear 'Where' and 'Why' - (Mis-)Interpreting SQL to Derive Fine-Grained Provenance	2018	VLDB	4.5541444e-05
8,248	Flare & Lantern: Efficiently Swapping Horses Midstream	2019	VLDB	4.5509332e-05
8,396	Optimizing Declarative Graph Queries at Large Scale	2019	SIGMOD	4.5276541e-05
8,429	Handling Environments in a Nested Relational Algebra with Combinators and an Implementation in a Verified Query Compiler	2017	SIGMOD	4.5156925e-05
8,479	Excalibur: A Virtual Machine for Adaptive Fine-grained JIT-Compiled Query Execution based on VOILA	2023	VLDB	4.5014929e-05
8,506	New Query Optimization Techniques in the Spark Engine of Azure Synapse	2022	VLDB	4.4957661e-05
8,534	Translation of Array-Based Loops to Distributed Data-Parallel Programs	2020	VLDB	4.4937074e-05
8,617	A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning	2024	VLDB	4.4846425e-05
8,645	Predicate Pushdown for Data Science Pipelines	2023	SIGMOD	4.4772518e-05
8,672	Optimizing Video Selection LIMIT Queries With Commonsense Knowledge	2024	VLDB	4.4710897e-05
8,758	Hyperspace: The Indexing Subsystem of Azure Synapse	2021	VLDB	4.456315e-05
8,781	Accelerate Distributed Joins with Predicate Transfer	2025	SIGMOD	4.4534753e-05
8,980	HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries	2021	SIGMOD	4.4169807e-05
9,001	The Power of Nested Parallelism in Big Data Processing – Hitting Three Flies with One Slap –	2021	SIGMOD	4.4107627e-05
9,016	Making Data Engineering Declarative	2023	CIDR	4.4094312e-05
9,093	Databricks Lakeguard: Supporting Fine-grained Access Control and Multi-user Capabilities for Apache Spark Workloads	2025	SIGMOD	4.398149e-05
9,124	Dynamic Speculative Optimizations for SQL Compilation in Apache Spark	2020	VLDB	4.391961e-05

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
3	Pig Latin: A Not-So-Foreign Language for Data Processing	2008	SIGMOD	0.0024183614
37	Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud	2012	VLDB	0.0007522744
42	A Comparison of Approaches to Large-Scale Data Analysis	2009	SIGMOD	0.00073498298
109	Dremel: Interactive Analysis of Web-Scale Datasets	2010	VLDB	0.00048186983
132	The EXODUS Optimizer Generator	1987	SIGMOD	0.00042994082
168	MAD Skills: New Analysis Practices for Big Data	2009	VLDB	0.00038946305
476	Impala: A Modern, Open-Source SQL Engine for Hadoop	2015	CIDR	0.00022226941
542	Shark: SQL and Rich Analytics at Scale	2013	SIGMOD	0.00020595648
704	Building Efficient Query Engines in a High-Level Language	2014	VLDB	0.00017900583
1,163	Extracting Schema from Semistructured Data	1998	SIGMOD	0.00013577466
1,721	Distributed Data-Parallel Computing Using a High-Level Programming Language	2009	SIGMOD	0.00010762918
2,001	Sinew: A SQL System for Multi-Structured Data	2014	SIGMOD	9.8186417e-05
2,355	G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data	2015	SIGMOD	8.9677847e-05
2,864	Inferring XML Schema Definitions from XML Data	2007	VLDB	7.9863574e-05
3,058	Rethinking Data-Intensive Science Using Scalable Analytics Systems	2015	SIGMOD	7.6410159e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
3,535	Scaling Spark in the Real World: Performance and Usability	2015	VLDB	6.9992495e-05
542	Shark: SQL and Rich Analytics at Scale	2013	SIGMOD	0.00020595648
557	SystemML: Declarative Machine Learning on Spark	2016	VLDB	0.00020197988
3,200	Big Data Analytics with Datalog Queries on Spark	2016	SIGMOD	7.3912411e-05
11,576	RASQL: A Powerful Language and its System for Big Data Applications	2020	SIGMOD	4.1945683e-05
1,548	Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark	2018	SIGMOD	0.00011431383
6,784	SparkR: Scaling R Programs with Spark	2016	SIGMOD	4.9265155e-05
9,584	Introduction to Spark 2.0 for Database Researchers	2016	SIGMOD	4.3218691e-05
9,124	Dynamic Speculative Optimizations for SQL Compilation in Apache Spark	2020	VLDB	4.391961e-05
2,919	RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark	2019	SIGMOD	7.9047279e-05