Back to papers
String Similarity Joins: An Experimental Evaluation
Summary: Comprehensive survey and standardized experimental evaluation of string similarity join algorithms. Classification by core techniques, cross-dataset comparison under a unified framework, and practical insights guiding algorithm selection for data integration and cleansing.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 10940
- Venue
- VLDB
- Year
- 2014
- Pagerank
- 8.1980628e-05
- Overall Rank
- 2,740 | 80.94%
- DOI
-
-
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 27 of 27 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 3,459 |
An Empirical Evaluation of Set Similarity Join Techniques |
2016 |
VLDB |
7.072508e-05 |
| 4,050 |
An Efficient Partition Based Method for Exact Set Similarity Joins |
2016 |
VLDB |
6.4953612e-05 |
| 4,250 |
Local Similarity Search for Unstructured Text |
2016 |
SIGMOD |
6.3241139e-05 |
| 4,353 |
Overlap Set Similarity Joins with Theoretical Guarantees |
2018 |
SIGMOD |
6.263585e-05 |
| 4,402 |
Smurf: Self-Service String Matching Using Random Forests |
2019 |
VLDB |
6.2195162e-05 |
| 4,684 |
Approximate String Joins with Abbreviations |
2018 |
VLDB |
6.0006406e-05 |
| 4,775 |
Set Similarity Joins on MapReduce: An Experimental Survey |
2018 |
VLDB |
5.9315784e-05 |
| 5,228 |
Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data |
2016 |
VLDB |
5.6158315e-05 |
| 5,469 |
Learned Cardinality Estimation for Similarity Queries |
2021 |
SIGMOD |
5.4898192e-05 |
| 6,074 |
Pigeonring: A Principle for Faster Thresholded Similarity Search |
2019 |
VLDB |
5.2242306e-05 |
| 6,512 |
Trajectory Similarity Measurement: An Efficiency Perspective |
2024 |
VLDB |
5.0321577e-05 |
| 6,595 |
Trajectory Similarity Join in Spatial Networks |
2017 |
VLDB |
4.9993852e-05 |
| 6,605 |
Dima: A Distributed In-Memory Similarity-Based Query Processing System |
2017 |
VLDB |
4.9965703e-05 |
| 6,726 |
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search |
2014 |
SIGMOD |
4.9484027e-05 |
| 7,109 |
Efficient Similarity Join and Search on Multi-Attribute Data |
2015 |
SIGMOD |
4.8292998e-05 |
| 7,237 |
CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning |
2017 |
VLDB |
4.7928651e-05 |
| 7,416 |
MILC: Inverted List Compression in Memory |
2017 |
VLDB |
4.7355258e-05 |
| 7,668 |
Human-in-the-loop Data Integration |
2017 |
VLDB |
4.6834075e-05 |
| 8,093 |
Scalable Distributed Inverted List Indexes in Disaggregated Memory |
2024 |
SIGMOD |
4.5873721e-05 |
| 9,563 |
Towards a Unified Framework for String Similarity Joins |
2019 |
VLDB |
4.3254416e-05 |
| 9,832 |
Balance-Aware Distributed String Similarity-Based Query Processing System |
2019 |
VLDB |
4.2751057e-05 |
| 9,932 |
Local Filtering: Improving the Performance of Approximate Queries on String Collections |
2015 |
SIGMOD |
4.2500258e-05 |
| 10,706 |
Extensible and Robust Evaluation of Similarity Queries |
2025 |
VLDB |
4.1945683e-05 |
| 11,087 |
Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching |
2024 |
VLDB |
4.1945683e-05 |
| 11,305 |
TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching |
2023 |
VLDB |
4.1945683e-05 |
| 11,724 |
ZigZag: Supporting Similarity Queries on Vector Space Models |
2018 |
SIGMOD |
4.1945683e-05 |
| 11,788 |
CDB: Optimizing Queries with Crowd-Based Selections and Joins |
2017 |
SIGMOD |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 18 of 18 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 91 |
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces |
1997 |
VLDB |
0.0005181666 |
| 125 |
Approximate String Joins in a Database (Almost) for Free |
2001 |
VLDB |
0.00044847972 |
| 250 |
Efficient set joins on similarity predicates |
2004 |
SIGMOD |
0.00030661988 |
| 266 |
Efficient Exact Set-Similarity Joins |
2006 |
VLDB |
0.00029718727 |
| 447 |
Efficient Parallel Set-Similarity Joins Using MapReduce |
2010 |
SIGMOD |
0.00022900171 |
| 1,234 |
Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints |
2008 |
VLDB |
0.00013122499 |
| 1,305 |
Bayesian Locality Sensitive Hashing for Fast Similarity Search |
2012 |
VLDB |
0.00012687101 |
| 1,396 |
Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search |
2012 |
SIGMOD |
0.00012204748 |
| 1,715 |
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors |
2012 |
VLDB |
0.00010803271 |
| 2,376 |
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance |
2010 |
SIGMOD |
8.9424361e-05 |
| 2,592 |
Pass-Join: A Partition-based Method for Similarity Joins |
2012 |
VLDB |
8.4795761e-05 |
| 3,774 |
Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme |
2011 |
SIGMOD |
6.7757301e-05 |
| 4,216 |
Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints |
2010 |
VLDB |
6.3521675e-05 |
| 4,873 |
Power-Law Based Estimation of Set Similarity Join Size |
2009 |
VLDB |
5.8602304e-05 |
| 4,901 |
Probabilistic String Similarity Joins |
2010 |
SIGMOD |
5.8411648e-05 |
| 5,220 |
Similarity Join Size Estimation using Locality Sensitive Hashing |
2011 |
VLDB |
5.6216111e-05 |
| 6,726 |
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search |
2014 |
SIGMOD |
4.9484027e-05 |
| 7,847 |
Set Similarity Join on Probabilistic Data |
2010 |
VLDB |
4.6365272e-05 |
Semantically Similar Papers