Database Paper Browser

Back to papers

Efficient Parallel Set-Similarity Joins Using MapReduce

Summary: 3-stage end-to-end parallel set-similarity joins on MapReduce. Balanced partitioning with minimal replication, memory-aware processing for self- and R-S joins, and out-of-core strategies when data exceed node memory, with Hadoop-based speedup/scaleup results. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
4269
Venue
SIGMOD
Year
2010
Pagerank
0.00022900171
Overall Rank
447 | 96.90%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 44 of 44 citing papers.

Rank Citing Paper Year Venue Pagerank
712 Magellan: Toward Building Entity Matching Management Systems 2016 VLDB 0.00017732426
818 Finding Related Tables 2012 SIGMOD 0.00016311524
1,074 Processing Theta-Joins using MapReduce* 2011 SIGMOD 0.00014260096
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,308 Upper and Lower Bounds on the Cost of a Map-Reduce Computation 2013 VLDB 0.00012661651
1,396 Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search 2012 SIGMOD 0.00012204748
1,715 V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors 2012 VLDB 0.00010803271
1,931 Efficient Processing of k Nearest Neighbor Joins using MapReduce 2012 VLDB 0.00010040427
2,024 ATLAS: A Probabilistic Algorithm for High Dimensional Similarity Search 2011 SIGMOD 9.7519678e-05
2,175 Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services 2017 SIGMOD 9.3644117e-05
2,337 Efficient Processing of Data Warehousing Queries in a Split Execution Environment 2011 SIGMOD 9.0098186e-05
2,592 Pass-Join: A Partition-based Method for Similarity Joins 2012 VLDB 8.4795761e-05
2,674 Minimal MapReduce Algorithms 2013 SIGMOD 8.3328645e-05
2,740 String Similarity Joins: An Experimental Evaluation 2014 VLDB 8.1980628e-05
3,062 Efficient Multi-way Theta-Join Processing Using MapReduce 2012 VLDB 7.6343994e-05
3,129 Scalable Big Graph Processing in MapReduce 2014 SIGMOD 7.5008242e-05
3,141 ClusterJoin: A Similarity Joins Framework using Map-Reduce 2014 VLDB 7.4829448e-05
3,263 QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications 2015 SIGMOD 7.3097573e-05
3,459 An Empirical Evaluation of Set Similarity Join Techniques 2016 VLDB 7.072508e-05
3,490 Leveraging Set Relations in Exact Set Similarity Join 2017 VLDB 7.0465856e-05
3,528 Distributed Data Deduplication 2016 VLDB 7.0066139e-05
4,050 An Efficient Partition Based Method for Exact Set Similarity Joins 2016 VLDB 6.4953612e-05
4,147 Exploiting MapReduce-based Similarity Joins 2012 SIGMOD 6.4096022e-05
4,402 Smurf: Self-Service String Matching Using Random Forests 2019 VLDB 6.2195162e-05
4,493 ASTERIX: An Open Source System for "Big Data" Management and Analysis (Demo) 2012 VLDB 6.141595e-05
4,775 Set Similarity Joins on MapReduce: An Experimental Survey 2018 VLDB 5.9315784e-05
5,434 Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples 2021 SIGMOD 5.5045402e-05
5,902 The Communication Complexity of Distributed Set-Joins with Applications to Matrix Multiplication 2015 PODS 5.2796864e-05
5,903 Building Wavelet Histograms on Large Data in MapReduce 2012 VLDB 5.2791351e-05
6,099 WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis 2013 VLDB 5.2104516e-05
6,507 Similarity Join over Array Data 2016 SIGMOD 5.0337166e-05
6,605 Dima: A Distributed In-Memory Similarity-Based Query Processing System 2017 VLDB 4.9965703e-05
7,153 Submodularity of Distributed Join Computation 2018 SIGMOD 4.8153963e-05
7,215 SyncSignature: A Simple, Efficient, Parallelizable Framework for Tree Similarity Joins 2023 VLDB 4.7985991e-05
7,588 Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases 2013 VLDB 4.7030914e-05
7,668 Human-in-the-loop Data Integration 2017 VLDB 4.6834075e-05
8,137 Customizable and Scalable Fuzzy Join for Big Data 2019 VLDB 4.5774794e-05
8,291 TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection 2022 SIGMOD 4.5435639e-05
9,115 MapReduce Algorithms for Big Data Analysis 2012 VLDB 4.3932167e-05
9,502 Streaming Similarity Self-Join 2016 VLDB 4.3341665e-05
9,832 Balance-Aware Distributed String Similarity-Based Query Processing System 2019 VLDB 4.2751057e-05
10,930 Similarity Joins of Sparse Features 2024 SIGMOD 4.1945683e-05
11,724 ZigZag: Supporting Similarity Queries on Vector Space Models 2018 SIGMOD 4.1945683e-05
11,976 Anti-Combining for MapReduce 2014 SIGMOD 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 11 of 11 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers