Database Paper Browser

Back to papers

Overlap Set Similarity Joins with Theoretical Guarantees

Summary: Overlap Set Similarity Joins with Theoretical Guarantees introduces a size-aware algorithm for c-overlap joins with time O(n^{2-1/c} k^{1/(2c)}). It partitions sets into small/large, uses large-set methods for the large group, and adds small-set heuristics plus a boundary optimizer, yielding strong practical speedups. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5468
Venue
SIGMOD
Year
2018
Pagerank
6.263585e-05
Overall Rank
4,353 | 69.72%
DOI
10.1145/3183713.3183748

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 16 of 16 citing papers.

Rank Citing Paper Year Venue Pagerank
4,402 Smurf: Self-Service String Matching Using Random Forests 2019 VLDB 6.2195162e-05
5,469 Learned Cardinality Estimation for Similarity Queries 2021 SIGMOD 5.4898192e-05
6,074 Pigeonring: A Principle for Faster Thresholded Similarity Search 2019 VLDB 5.2242306e-05
6,647 Fast Join Project Query Evaluation using Matrix Multiplication 2020 SIGMOD 4.9772122e-05
7,635 Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts 2021 SIGMOD 4.6908858e-05
7,765 Cache-oblivious High-performance Similarity Join 2019 SIGMOD 4.6572085e-05
8,291 TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection 2022 SIGMOD 4.5435639e-05
8,910 R2D2: Reducing Redundancy and Duplication in Data Lakes 2023 SIGMOD 4.427232e-05
8,966 Output-sensitive Conjunctive Query Evaluation 2024 PODS 4.4193184e-05
9,832 Balance-Aware Distributed String Similarity-Based Query Processing System 2019 VLDB 4.2751057e-05
9,876 Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation 2023 SIGMOD 4.2667743e-05
10,245 SeDA: Bridging the Gap between Efficient Syntactic and Precise Semantic Search of Similar Passages in Large Text Corpora 2026 VLDB 4.1945683e-05
10,706 Extensible and Robust Evaluation of Similarity Queries 2025 VLDB 4.1945683e-05
10,951 Determining the Largest Overlap between Tables 2024 SIGMOD 4.1945683e-05
11,247 A Two-Level Signature Scheme for Stable Set Similarity Joins 2023 VLDB 4.1945683e-05
11,504 LES3: Learning-based Exact Set Similarity Search 2021 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers

Overall Rank Paper Year Venue Pagerank
10,921 Optimal (Multiway) Spatial Joins 2024 PODS 4.1945683e-05
11,247 A Two-Level Signature Scheme for Stable Set Similarity Joins 2023 VLDB 4.1945683e-05
4,873 Power-Law Based Estimation of Set Similarity Join Size 2009 VLDB 5.8602304e-05
8,899 Fast Approximate Similarity Join in Vector Databases 2025 SIGMOD 4.427232e-05
2,464 Fast Set Intersection in Memory 2011 VLDB 8.7524354e-05
3,459 An Empirical Evaluation of Set Similarity Join Techniques 2016 VLDB 7.072508e-05
250 Efficient set joins on similarity predicates 2004 SIGMOD 0.00030661988
266 Efficient Exact Set-Similarity Joins 2006 VLDB 0.00029718727
3,490 Leveraging Set Relations in Exact Set Similarity Join 2017 VLDB 7.0465856e-05
4,050 An Efficient Partition Based Method for Exact Set Similarity Joins 2016 VLDB 6.4953612e-05