Joins on Samples: A Theoretical Guide for Practitioners

Summary: Revisits sample-based joins for AQP, challenging the futility view and bounding join estimation by output size and variance. Proposes a sampling scheme (Bernoulli, universe) with optimal parameters plus a distributed variant; validated on SQL/AQP engines. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 12256
Venue: VLDB
Year: 2020
Pagerank: 5.039683e-05
Overall Rank: 6,481 | 54.96%
DOI: 10.14778/3372721.3372726

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 8 of 8 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
3,777	Instance-Optimized Data Layouts for Cloud Analytics Workloads	2021	SIGMOD	6.7713324e-05
3,827	Correlation Sketches for Approximate Join-Correlation Queries	2021	SIGMOD	6.7195959e-05
3,924	A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation	2021	SIGMOD	6.6227223e-05
5,022	Towards Distribution-aware Query Answering in Data Markets	2022	VLDB	5.7479778e-05
8,642	One Size Does Not Fit All: A Bandit-Based Sampler Combination Framework with Theoretical Guarantees	2022	SIGMOD	4.4734993e-05
9,116	Towards Observability for Production Machine Learning Pipelines	2022	VLDB	4.3886184e-05
9,238	PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees	2025	SIGMOD	4.3648789e-05
10,984	Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and Quality	2024	SIGMOD	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 37 of 37 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
14	Online Aggregation	1997	SIGMOD	0.0010813443
18	On Random Sampling over Joins	1999	SIGMOD	0.00092569117
203	Learned Cardinalities: Estimating Correlated Joins with Deep Learning	2019	CIDR	0.00034868567
212	Join Synopses for Approximate Query Answering	1999	SIGMOD	0.00033997204
216	Ripple Joins for Online Aggregation	1999	SIGMOD	0.00033560137
941	Wander Join: Online Aggregation via Random Walks	2016	SIGMOD	0.00015147831
960	Aqua: A Fast Decision Support System Using Approximate Query Answers	1999	VLDB	0.00015031055
1,065	Processing Complex Aggregate Queries over Data Streams	2002	SIGMOD	0.00014344675
1,104	Cardinality Estimation Done Right: Index-Based Join Sampling	2017	CIDR	0.0001398479
1,151	Blink and It's Done: Interactive Queries on Very Large Data	2012	VLDB	0.00013634671
1,161	VerdictDB: Universalizing Approximate Query Processing	2018	SIGMOD	0.00013579831
1,194	Join Size Estimation Subject to Filter Conditions	2015	VLDB	0.00013411666
1,257	Dynamic Sample Selection for Approximate Query Processing	2003	SIGMOD	0.00013002384
1,320	Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters	2016	SIGMOD	0.00012606067
1,331	ICICLES: Self-tuning Samples for Approximate Query Answering	2000	VLDB	0.00012553948
1,372	Random Sampling over Joins Revisited	2018	SIGMOD	0.0001233325
1,451	Online Aggregation for Large MapReduce Jobs	2011	VLDB	0.00011925842
1,727	QuickSel: Quick Selectivity Learning with Mixture Models	2020	SIGMOD	0.00010731889
1,756	Sampling-Based Query Re-Optimization	2016	SIGMOD	0.00010659753
1,790	Effective Use of Block-Level Sampling in Statistics Estimation	2004	SIGMOD	0.00010529479
1,867	Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems	2014	SIGMOD	0.00010264932
2,208	A Scalable Hash Ripple Join Algorithm	2002	SIGMOD	9.2887018e-05
2,253	Vizdom: Interactive Analytics through Pen and Touch	2015	VLDB	9.1886315e-05
2,254	Two-Level Sampling for Join Size Estimation	2017	SIGMOD	9.1871115e-05
2,424	The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing	2014	SIGMOD	8.8415494e-05
2,589	Database Learning: Toward a Database that Becomes Smarter Every Time	2017	SIGMOD	8.4868591e-05
2,779	Hashed Samples: Selectivity Estimators For Set Similarity Selection Queries	2008	VLDB	8.1314377e-05
3,117	Distributed Lock Management with RDMA: Decentralization without Starvation	2018	SIGMOD	7.5400979e-05
3,335	SnappyData: A Unified Cluster for Streaming, Transactions, and Interactive Analytics	2017	CIDR	7.2023806e-05
3,596	Continuous Sampling for Online Aggregation Over Multiple Queries	2010	SIGMOD	6.9342283e-05
3,713	Is Min-Wise Hashing Optimal for Summarizing Set Intersection?	2014	PODS	6.8182579e-05
3,808	Turbo-Charging Estimate Convergence in DBO	2009	VLDB	6.7416988e-05
3,841	I've Seen "Enough": Incrementally Improving Visualizations to Support Rapid Decision Making	2017	VLDB	6.7090738e-05
4,100	A Bi-Level Bernoulli Scheme for Database Sampling	2004	SIGMOD	6.4473679e-05
4,244	A Disk-Based Join With Probabilistic Guarantees*	2005	SIGMOD	6.3228453e-05
5,578	CliffGuard: A Principled Framework for Finding Robust Database Designs	2015	SIGMOD	5.4231783e-05
6,402	Approximate Query Engines: Commercial Challenges and Research Opportunities	2017	SIGMOD	5.0725227e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
92	Practical Selectivity Estimation through Adaptive Sampling	1990	SIGMOD	0.00051431888
550	Tracking Join and Self-Join Sizes in Limited Storage	1999	PODS	0.00020346247
1,867	Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems	2014	SIGMOD	0.00010264932
1,254	Fixed-Precision Estimation of Join Selectivity	1993	PODS	0.00013018797
2,254	Two-Level Sampling for Join Size Estimation	2017	SIGMOD	9.1871115e-05
2,583	Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee	2016	SIGMOD	8.4973431e-05
1,372	Random Sampling over Joins Revisited	2018	SIGMOD	0.0001233325
8,964	Reservoir Sampling over Joins	2024	SIGMOD	4.4163852e-05
6,724	Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing	2021	SIGMOD	4.9449472e-05
18	On Random Sampling over Joins	1999	SIGMOD	0.00092569117