Database Paper Browser

Back to papers

ClusterJoin: A Similarity Joins Framework using Map-Reduce

Summary: ClusterJoin: a MapReduce framework for scalable similarity joins that partitions data by distribution and routes records to relevant partitions. Bisector-based candidate filters with sampling-driven load balancing deliver probabilistic guarantees and robust scalability for high-dimensional, low-threshold data. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
10766
Venue
VLDB
Year
2014
Pagerank
7.4829448e-05
Overall Rank
3,141 | 78.16%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 21 of 21 citing papers.

Rank Citing Paper Year Venue Pagerank
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
2,175 Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services 2017 SIGMOD 9.3644117e-05
3,490 Leveraging Set Relations in Exact Set Similarity Join 2017 VLDB 7.0465856e-05
3,528 Distributed Data Deduplication 2016 VLDB 7.0066139e-05
4,402 Smurf: Self-Service String Matching Using Random Forests 2019 VLDB 6.2195162e-05
4,574 Incremental View Maintenance over Array Data 2017 SIGMOD 6.0738556e-05
4,775 Set Similarity Joins on MapReduce: An Experimental Survey 2018 VLDB 5.9315784e-05
5,434 Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples 2021 SIGMOD 5.5045402e-05
6,261 The Cosmos Big Data Platform at Microsoft: Over a Decade of Progress and a Decade to Look Forward 2021 VLDB 5.1350714e-05
6,507 Similarity Join over Array Data 2016 SIGMOD 5.0337166e-05
6,619 Near-Optimal Distributed Band-Joins through Recursive Partitioning 2020 SIGMOD 4.9910152e-05
6,690 Parallel Discrepancy Detection and Incremental Detection 2021 VLDB 4.9621556e-05
7,153 Submodularity of Distributed Join Computation 2018 SIGMOD 4.8153963e-05
7,237 CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning 2017 VLDB 4.7928651e-05
7,838 Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes 2021 SIGMOD 4.6377995e-05
8,137 Customizable and Scalable Fuzzy Join for Big Data 2019 VLDB 4.5774794e-05
8,575 THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads 2015 SIGMOD 4.4928872e-05
10,068 DiskJoin: Large-scale Vector Similarity Join with SSD 2026 SIGMOD 4.1945683e-05
10,930 Similarity Joins of Sparse Features 2024 SIGMOD 4.1945683e-05
11,215 Correlation Joins over Time Series Data Streams Utilizing Complementary Dimension Reduction and Transformation 2023 SIGMOD 4.1945683e-05
11,504 LES3: Learning-based Exact Set Similarity Search 2021 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 7 of 7 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers