Customizable and Scalable Fuzzy Join for Big Data
Summary: Customizable, scalable fuzzy join for big data using LSH-based signatures to handle domain-quality issues such as synonyms and abbreviations. On Azure Databricks Spark, it delivers >50x speedup over prior scale-out methods with near-linear scalability in data size and cluster size. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Zhimin Chen
- 2. Yue Wang
- 3. Vivek Narasayya
- 4. Surajit Chaudhuri
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,942 | Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins | 2022 | VLDB | 6.6114622e-05 |
| 7,476 | Lachesis: Automatic Partitioning for UDF-Centric Analytics | 2021 | VLDB | 4.7188928e-05 |
| 10,754 | OmniMatch: Joinability Discovery in Data Products | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 12 of 12 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 7,250 | A Scalable and Generic Approach to Range Joins | 2022 | VLDB | 4.78908e-05 |
| 4,901 | Probabilistic String Similarity Joins | 2010 | SIGMOD | 5.8411648e-05 |
| 11,358 | Scaling Equi-Joins | 2022 | SIGMOD | 4.1945683e-05 |
| 11,890 | Let's Rethink Join Optimization in Distributed Systems | 2015 | CIDR | 4.1945683e-05 |
| 1,396 | Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search | 2012 | SIGMOD | 0.00012204748 |
| 11,305 | TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching | 2023 | VLDB | 4.1945683e-05 |
| 3,141 | ClusterJoin: A Similarity Joins Framework using Map-Reduce | 2014 | VLDB | 7.4829448e-05 |
| 5,434 | Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples | 2021 | SIGMOD | 5.5045402e-05 |
| 155 | Robust and Efficient Fuzzy Match for Online Data Cleaning | 2003 | SIGMOD | 0.00040637896 |
| 10,930 | Similarity Joins of Sparse Features | 2024 | SIGMOD | 4.1945683e-05 |