Spark SQL: Relational Data Processing in Spark

Summary: Relational processing integrated into Spark via DataFrame API, unifying SQL queries with Spark's functional workflow. Catalyst, a Scala-based extensible optimizer, enables composable rules, code generation, JSON schema inference, and federation to databases. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 5022
Venue: SIGMOD
Year: 2015
Pagerank: 0.00061639801
Overall Rank: 66 | 99.55%
DOI: 10.1145/2723372.2742797

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 6 of 206 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
11,694	An Experimental Evaluation of Garbage Collectors on Big Data Applications	2019	VLDB	4.1945683e-05
11,749	An Authorization Model for Multi-Provider Queries	2018	VLDB	4.1945683e-05
11,753	Effective Temporal Dependence Discovery in Time Series Data	2018	VLDB	4.1945683e-05
11,774	Query Processing Techniques for Big Spatial-Keyword Data	2017	SIGMOD	4.1945683e-05
11,948	Tutorial: SQL-on-Hadoop Systems	2015	VLDB	4.1945683e-05
13,096	Blink Twice - Automatic Workload Pinning and Regression Detection for Versionless Apache Spark using Retries	2025	SIGMOD	-

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
3	Pig Latin: A Not-So-Foreign Language for Data Processing	2008	SIGMOD	0.0024183614
37	Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud	2012	VLDB	0.0007522744
42	A Comparison of Approaches to Large-Scale Data Analysis	2009	SIGMOD	0.00073498298
109	Dremel: Interactive Analysis of Web-Scale Datasets	2010	VLDB	0.00048186983
132	The EXODUS Optimizer Generator	1987	SIGMOD	0.00042994082
168	MAD Skills: New Analysis Practices for Big Data	2009	VLDB	0.00038946305
476	Impala: A Modern, Open-Source SQL Engine for Hadoop	2015	CIDR	0.00022226941
542	Shark: SQL and Rich Analytics at Scale	2013	SIGMOD	0.00020595648
704	Building Efficient Query Engines in a High-Level Language	2014	VLDB	0.00017900583
1,163	Extracting Schema from Semistructured Data	1998	SIGMOD	0.00013577466
1,721	Distributed Data-Parallel Computing Using a High-Level Programming Language	2009	SIGMOD	0.00010762918
2,001	Sinew: A SQL System for Multi-Structured Data	2014	SIGMOD	9.8186417e-05
2,355	G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data	2015	SIGMOD	8.9677847e-05
2,864	Inferring XML Schema Definitions from XML Data	2007	VLDB	7.9863574e-05
3,058	Rethinking Data-Intensive Science Using Scalable Analytics Systems	2015	SIGMOD	7.6410159e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
3,535	Scaling Spark in the Real World: Performance and Usability	2015	VLDB	6.9992495e-05
542	Shark: SQL and Rich Analytics at Scale	2013	SIGMOD	0.00020595648
557	SystemML: Declarative Machine Learning on Spark	2016	VLDB	0.00020197988
3,200	Big Data Analytics with Datalog Queries on Spark	2016	SIGMOD	7.3912411e-05
11,576	RASQL: A Powerful Language and its System for Big Data Applications	2020	SIGMOD	4.1945683e-05
1,548	Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark	2018	SIGMOD	0.00011431383
6,784	SparkR: Scaling R Programs with Spark	2016	SIGMOD	4.9265155e-05
9,584	Introduction to Spark 2.0 for Database Researchers	2016	SIGMOD	4.3218691e-05
9,124	Dynamic Speculative Optimizations for SQL Compilation in Apache Spark	2020	VLDB	4.391961e-05
2,919	RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark	2019	SIGMOD	7.9047279e-05