Efficient set joins on similarity predicates
Summary: General, scalable algorithm for set joins on similarity predicates (intersect size, Jaccard, cosine, edit distance) extending beyond simple containment. Inverted-index probing with staged optimizations, memory-efficient partitioning, and index compression enabling in-memory operation; generalizes to weighted/unweighted partial word overlap. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Sunita Sarawagi
- 2. Alok Kirpal
Incoming Citations (Sorted by Pagerank)
Showing 50 of 53 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 9 of 9 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 36 | Fast Algorithms for Mining Association Rules | 1994 | VLDB | 0.00076161096 |
| 125 | Approximate String Joins in a Database (Almost) for Free | 2001 | VLDB | 0.00044847972 |
| 152 | An Evaluation of Non-Equijoin Algorithms | 1991 | VLDB | 0.00040963225 |
| 155 | Robust and Efficient Fuzzy Match for Online Data Cleaning | 2003 | SIGMOD | 0.00040637896 |
| 280 | Eliminating Fuzzy Duplicates in Data Warehouses | 2002 | VLDB | 0.00029113044 |
| 1,048 | Set Containment Joins: The Good, The Bad and The Ugly | 2000 | VLDB | 0.00014457009 |
| 1,562 | Evaluation of Main Memory Join Algorithms for Joins with Subset Join Predicates | 1997 | VLDB | 0.00011356744 |
| 1,763 | Efficient Processing of Joins on Set-valued Attributes | 2003 | SIGMOD | 0.00010638276 |
| 2,171 | Selectivity Estimation For Boolean Queries | 2000 | PODS | 9.3807165e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,706 | Extensible and Robust Evaluation of Similarity Queries | 2025 | VLDB | 4.1945683e-05 |
| 447 | Efficient Parallel Set-Similarity Joins Using MapReduce | 2010 | SIGMOD | 0.00022900171 |
| 8,899 | Fast Approximate Similarity Join in Vector Databases | 2025 | SIGMOD | 4.427232e-05 |
| 7,109 | Efficient Similarity Join and Search on Multi-Attribute Data | 2015 | SIGMOD | 4.8292998e-05 |
| 4,353 | Overlap Set Similarity Joins with Theoretical Guarantees | 2018 | SIGMOD | 6.263585e-05 |
| 4,050 | An Efficient Partition Based Method for Exact Set Similarity Joins | 2016 | VLDB | 6.4953612e-05 |
| 1,763 | Efficient Processing of Joins on Set-valued Attributes | 2003 | SIGMOD | 0.00010638276 |
| 266 | Efficient Exact Set-Similarity Joins | 2006 | VLDB | 0.00029718727 |
| 3,459 | An Empirical Evaluation of Set Similarity Join Techniques | 2016 | VLDB | 7.072508e-05 |
| 3,490 | Leveraging Set Relations in Exact Set Similarity Join | 2017 | VLDB | 7.0465856e-05 |