Set Similarity Joins on MapReduce: An Experimental Survey
Summary: Experimental survey of ten distributed MapReduce set similarity join algorithms. Uniform benchmarking across 12 datasets reveals no universal scalability; long sets, frequent elements, or low thresholds degrade performance, with analytic root-cause insights and suggested future directions. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 7 of 7 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,187 | JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes | 2019 | SIGMOD | 0.00013443639 |
| 5,794 | Discovering Related Data At Scale | 2021 | VLDB | 5.3245122e-05 |
| 6,619 | Near-Optimal Distributed Band-Joins through Recursive Partitioning | 2020 | SIGMOD | 4.9910152e-05 |
| 7,765 | Cache-oblivious High-performance Similarity Join | 2019 | SIGMOD | 4.6572085e-05 |
| 8,137 | Customizable and Scalable Fuzzy Join for Big Data | 2019 | VLDB | 4.5774794e-05 |
| 10,930 | Similarity Joins of Sparse Features | 2024 | SIGMOD | 4.1945683e-05 |
| 11,305 | TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching | 2023 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 12 of 12 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,490 | Leveraging Set Relations in Exact Set Similarity Join | 2017 | VLDB | 7.0465856e-05 |
| 15 | Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters | 2007 | SIGMOD | 0.0010654262 |
| 6,507 | Similarity Join over Array Data | 2016 | SIGMOD | 5.0337166e-05 |
| 250 | Efficient set joins on similarity predicates | 2004 | SIGMOD | 0.00030661988 |
| 266 | Efficient Exact Set-Similarity Joins | 2006 | VLDB | 0.00029718727 |
| 960 | A Comparison of Join Algorithms for Log Processing in MapReduce | 2010 | SIGMOD | 0.00015012242 |
| 3,459 | An Empirical Evaluation of Set Similarity Join Techniques | 2016 | VLDB | 7.072508e-05 |
| 4,147 | Exploiting MapReduce-based Similarity Joins | 2012 | SIGMOD | 6.4096022e-05 |
| 3,141 | ClusterJoin: A Similarity Joins Framework using Map-Reduce | 2014 | VLDB | 7.4829448e-05 |
| 447 | Efficient Parallel Set-Similarity Joins Using MapReduce | 2010 | SIGMOD | 0.00022900171 |