Database Paper Browser

Back to papers

Spark SQL: Relational Data Processing in Spark

Summary: Relational processing integrated into Spark via DataFrame API, unifying SQL queries with Spark's functional workflow. Catalyst, a Scala-based extensible optimizer, enables composable rules, code generation, JSON schema inference, and federation to databases. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5022
Venue
SIGMOD
Year
2015
Pagerank
0.00061639801
Overall Rank
66 | 99.55%
DOI
10.1145/2723372.2742797

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 206 citing papers.

Rank Citing Paper Year Venue Pagerank
6,388 Optimizing Data-intensive Systems in Disaggregated Data Centers with TELEPORT 2022 SIGMOD 5.0851841e-05
6,541 ConnectorX: Accelerating Data Loading From Databases to Dataframes 2022 VLDB 5.0216945e-05
6,590 Interactive Demonstration of Probabilistic Predicates 2018 SIGMOD 5.0010949e-05
6,658 Scalable Querying of Nested Data 2021 VLDB 4.9711629e-05
6,673 Incorporating Super-Operators in Big-Data Query Optimizers 2020 VLDB 4.966799e-05
6,715 Shared Foundations: Modernizing Meta's Data Lakehouse 2023 CIDR 4.9509939e-05
6,745 DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs 2019 SIGMOD 4.9417155e-05
6,759 AStream: Ad-hoc Shared Stream Processing 2019 SIGMOD 4.9352213e-05
6,784 SparkR: Scaling R Programs with Spark 2016 SIGMOD 4.9265155e-05
6,871 Towards General and Efficient Online Tuning for Spark 2023 VLDB 4.8997004e-05
6,993 Unit Testing Data with Deequ 2019 SIGMOD 4.8693227e-05
7,059 Adaptive and Robust Query Execution for Lakehouses at Scale 2024 VLDB 4.8477825e-05
7,060 SquirrelJoin: Network-Aware Distributed Join Processing with Lazy Partitioning 2017 VLDB 4.8465382e-05
7,067 JetScope: Reliable and Interactive Analytics at Cloud Scale 2015 VLDB 4.8440936e-05
7,207 Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data 2016 VLDB 4.800763e-05
7,237 CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning 2017 VLDB 4.7928651e-05
7,296 Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and Opportunities 2022 SIGMOD 4.7723197e-05
7,387 Bubble Execution: Resource-aware Reliable Analytics at Cloud Scale 2018 VLDB 4.7438193e-05
7,399 SmartBench: A Benchmark For Data Management In Smart Spaces 2020 VLDB 4.7410149e-05
7,427 Selection Pushdown in Column Stores using Bit Manipulation Instructions 2023 SIGMOD 4.7327406e-05
7,534 Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams 2022 VLDB 4.7180004e-05
7,599 Quill: Efficient, Transferable, and Rich Analytics at Scale 2016 VLDB 4.7003593e-05
7,704 ExDRa: Exploratory Data Science on Federated Raw Data 2021 SIGMOD 4.6733838e-05
7,723 Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded 2017 VLDB 4.6676712e-05
7,818 A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data 2017 VLDB 4.6434716e-05
7,905 S2RDF: RDF Querying with SPARQL on Spark 2016 VLDB 4.6211706e-05
7,907 Petabyte-Scale Row-Level Operations in Data Lakehouses 2024 VLDB 4.6205839e-05
7,925 Architecting a Query Compiler for Spatial Workloads 2020 SIGMOD 4.6153403e-05
7,953 Shasta: Interactive Reporting At Scale 2016 SIGMOD 4.613363e-05
8,002 Pangea: Monolithic Distributed Storage for Data Analytics 2019 VLDB 4.6088289e-05
8,075 AJoin: Ad-hoc Stream Joins at Scale 2020 VLDB 4.5917655e-05
8,130 Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald 2024 PODS 4.5784634e-05
8,197 SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft 2021 VLDB 4.5607121e-05
8,230 You Say 'What', I Hear 'Where' and 'Why' - (Mis-)Interpreting SQL to Derive Fine-Grained Provenance 2018 VLDB 4.5541444e-05
8,248 Flare & Lantern: Efficiently Swapping Horses Midstream 2019 VLDB 4.5509332e-05
8,396 Optimizing Declarative Graph Queries at Large Scale 2019 SIGMOD 4.5276541e-05
8,429 Handling Environments in a Nested Relational Algebra with Combinators and an Implementation in a Verified Query Compiler 2017 SIGMOD 4.5156925e-05
8,479 Excalibur: A Virtual Machine for Adaptive Fine-grained JIT-Compiled Query Execution based on VOILA 2023 VLDB 4.5014929e-05
8,506 New Query Optimization Techniques in the Spark Engine of Azure Synapse 2022 VLDB 4.4957661e-05
8,534 Translation of Array-Based Loops to Distributed Data-Parallel Programs 2020 VLDB 4.4937074e-05
8,617 A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning 2024 VLDB 4.4846425e-05
8,645 Predicate Pushdown for Data Science Pipelines 2023 SIGMOD 4.4772518e-05
8,672 Optimizing Video Selection LIMIT Queries With Commonsense Knowledge 2024 VLDB 4.4710897e-05
8,758 Hyperspace: The Indexing Subsystem of Azure Synapse 2021 VLDB 4.456315e-05
8,781 Accelerate Distributed Joins with Predicate Transfer 2025 SIGMOD 4.4534753e-05
8,980 HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries 2021 SIGMOD 4.4169807e-05
9,001 The Power of Nested Parallelism in Big Data Processing – Hitting Three Flies with One Slap – 2021 SIGMOD 4.4107627e-05
9,016 Making Data Engineering Declarative 2023 CIDR 4.4094312e-05
9,093 Databricks Lakeguard: Supporting Fine-grained Access Control and Multi-user Capabilities for Apache Spark Workloads 2025 SIGMOD 4.398149e-05
9,124 Dynamic Speculative Optimizations for SQL Compilation in Apache Spark 2020 VLDB 4.391961e-05
Previous Page 3 / 5 Next

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers