Is Min-Wise Hashing Optimal for Summarizing Set Intersection?
Summary: Information-theoretic lower bounds show b-bit minwise hashing is space-optimal (variance within constant factor) for estimating intersections of two predicates, but k-permutation/min-hash schemes deteriorate as m grows. We provide new lower/upper bounds and a novel summary that nearly attains the lower bound for m≥2, asymptotically outperforming k-permutation schemes by Θ(m/log m) and subsampling by Θ(log n_max). (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Rasmus Pagh
- 2. Morten Stöckel
- 3. David P. Woodruff
Incoming Citations (Sorted by Pagerank)
Showing 7 of 7 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 5,361 | Efficient Estimation of Inclusion Coefficient using HyperLogLog Sketches | 2018 | VLDB | 5.547935e-05 |
| 6,493 | Joins on Samples: A Theoretical Guide for Practitioners | 2020 | VLDB | 5.0424713e-05 |
| 7,645 | Selectivity Estimation on Streaming Spatio-Textual Data Using Local Correlations | 2015 | VLDB | 4.6896215e-05 |
| 8,901 | On the Feasibility of Forgetting in Data Streams | 2024 | PODS | 4.427232e-05 |
| 9,038 | OmniSketch: Efficient Multi-Dimensional High-Velocity Stream Analytics with Arbitrary Predicates | 2024 | VLDB | 4.4039656e-05 |
| 11,025 | Sampling Methods for Inner Product Sketching | 2024 | VLDB | 4.1945683e-05 |
| 11,168 | Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation | 2023 | PODS | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 4 of 4 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 383 | An Optimal Algorithm for the Distinct Elements Problem | 2010 | PODS | 0.00024820873 |
| 3,991 | Beyond Simple Aggregates: Indexing for Summary Queries | 2011 | PODS | 6.5553055e-05 |
| 5,415 | Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments | 2009 | VLDB | 5.5196338e-05 |
| 5,782 | Information Complexity: a Tutorial | 2010 | PODS | 5.3292726e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 4,353 | Overlap Set Similarity Joins with Theoretical Guarantees | 2018 | SIGMOD | 6.263585e-05 |
| 5,220 | Similarity Join Size Estimation using Locality Sensitive Hashing | 2011 | VLDB | 5.6216111e-05 |
| 2,464 | Fast Set Intersection in Memory | 2011 | VLDB | 8.7524354e-05 |
| 5,813 | Space-efficient Substring Occurrence Estimation | 2011 | PODS | 5.3170565e-05 |
| 2,171 | Selectivity Estimation For Boolean Queries | 2000 | PODS | 9.3807165e-05 |
| 11,168 | Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation | 2023 | PODS | 4.1945683e-05 |
| 2,779 | Hashed Samples: Selectivity Estimators For Set Similarity Selection Queries | 2008 | VLDB | 8.1320575e-05 |
| 10,927 | Computing A Well-Representative Summary of Conjunctive Query Results | 2024 | PODS | 4.1945683e-05 |
| 6,244 | Approximate Distinct Counts for Billions of Datasets | 2019 | SIGMOD | 5.139669e-05 |
| 4,873 | Power-Law Based Estimation of Set Similarity Join Size | 2009 | VLDB | 5.8602304e-05 |