Similarity Join Size Estimation using Locality Sensitive Hashing
Summary: Introduces LSH-SS, a sampling-based VSJ estimator leveraging Locality-Sensitive Hashing to enable accurate sampling at high similarity thresholds, generalizing SSJ to vector representations (e.g., TF-IDF). Empirical results show LSH-SS delivers higher accuracy and lower variance than random sampling and an adapted SSJ baseline across thresholds on real datasets. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Hongrae Lee
- 2. Raymond T. Ng
- 3. Kyuseok Shim
Incoming Citations (Sorted by Pagerank)
Showing 5 of 5 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,396 | Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search | 2012 | SIGMOD | 0.00012204748 |
| 2,740 | String Similarity Joins: An Experimental Evaluation | 2014 | VLDB | 8.1980628e-05 |
| 2,969 | Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models | 2017 | VLDB | 7.7974762e-05 |
| 5,151 | String Similarity Measures and Joins with Synonyms | 2013 | SIGMOD | 5.6609851e-05 |
| 5,469 | Learned Cardinality Estimation for Similarity Queries | 2021 | SIGMOD | 5.4898192e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 10 of 10 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 18 | On Random Sampling over Joins | 1999 | SIGMOD | 0.00092385438 |
| 92 | Practical Selectivity Estimation through Adaptive Sampling | 1990 | SIGMOD | 0.00051315959 |
| 99 | On the Propagation of Errors in the Size of Join Results | 1991 | SIGMOD | 0.00050022914 |
| 250 | Efficient set joins on similarity predicates | 2004 | SIGMOD | 0.00030661988 |
| 266 | Efficient Exact Set-Similarity Joins | 2006 | VLDB | 0.00029718727 |
| 549 | Tracking Join and Self-Join Sizes in Limited Storage | 1999 | PODS | 0.00020376603 |
| 553 | Bifocal Sampling for Skew-Resistant Join Size Estimation | 1996 | SIGMOD | 0.00020272061 |
| 1,255 | Fixed-Precision Estimation of Join Selectivity | 1993 | PODS | 0.00013024064 |
| 2,779 | Hashed Samples: Selectivity Estimators For Set Similarity Selection Queries | 2008 | VLDB | 8.1320575e-05 |
| 4,873 | Power-Law Based Estimation of Set Similarity Join Size | 2009 | VLDB | 5.8602304e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,490 | Leveraging Set Relations in Exact Set Similarity Join | 2017 | VLDB | 7.0465856e-05 |
| 4,353 | Overlap Set Similarity Joins with Theoretical Guarantees | 2018 | SIGMOD | 6.263585e-05 |
| 6,241 | Scaling Similarity Joins over Tree-Structured Data | 2015 | VLDB | 5.1411469e-05 |
| 8,763 | Smooth Tradeoffs between Insert and Query Complexity in Nearest Neighbor Search | 2015 | PODS | 4.456315e-05 |
| 605 | Locality-Sensitive Hashing Scheme Based on Dynamic Collision Counting | 2012 | SIGMOD | 0.000193396 |
| 4,808 | On the Complexity of Inner Product Similarity Join | 2016 | PODS | 5.908896e-05 |
| 250 | Efficient set joins on similarity predicates | 2004 | SIGMOD | 0.00030661988 |
| 2,779 | Hashed Samples: Selectivity Estimators For Set Similarity Selection Queries | 2008 | VLDB | 8.1320575e-05 |
| 4,873 | Power-Law Based Estimation of Set Similarity Join Size | 2009 | VLDB | 5.8602304e-05 |
| 8,899 | Fast Approximate Similarity Join in Vector Databases | 2025 | SIGMOD | 4.427232e-05 |