Robust and Efficient Fuzzy Match for Online Data Cleaning
Summary: Proposes a novel similarity function addressing limitations of fuzzy-match metrics for data cleaning. Develops an efficient fuzzy-match algorithm for real-time validation/cleansing of incoming tuples against reference tables; demonstrated on real datasets. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Surajit Chaudhuri
- 2. Kris Ganjam
- 3. Venkatesh Ganti
- 4. Rajeev Motwani
Incoming Citations (Sorted by Pagerank)
Showing 7 of 57 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,216 | The Case For Language Model Approximated LIKE Predicate | 2026 | SIGMOD | 4.1945683e-05 |
| 11,162 | Towards Better Bounds for Finding Quasi-Identifiers * | 2023 | PODS | 4.1945683e-05 |
| 11,507 | TQEL: Framework for Query-Driven Linking of Top-K Entities in Social Media Blogs | 2021 | VLDB | 4.1945683e-05 |
| 12,371 | Building a Global Location Search Service | 2008 | SIGMOD | 4.1945683e-05 |
| 12,461 | Bridging the Application and DBMS Profiling Divide for Database Application Developers | 2007 | VLDB | 4.1945683e-05 |
| 12,478 | Randomized Algorithms for Data Reconciliation in Wide Area Aggregate Query Processing | 2007 | VLDB | 4.1945683e-05 |
| 12,544 | SPIDER: Flexible Matching in Databases | 2005 | SIGMOD | 4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 5 of 5 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 67 | The Merge/Purge Problem for Large Databases | 1995 | SIGMOD | 0.00061348205 |
| 91 | M-tree: An Efficient Access Method for Similarity Search in Metric Spaces | 1997 | VLDB | 0.0005181666 |
| 125 | Approximate String Joins in a Database (Almost) for Free | 2001 | VLDB | 0.00044847972 |
| 150 | Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity | 1998 | SIGMOD | 0.00041055843 |
| 280 | Eliminating Fuzzy Duplicates in Data Warehouses | 2002 | VLDB | 0.00029113044 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 266 | Efficient Exact Set-Similarity Joins | 2006 | VLDB | 0.00029718727 |
| 1,533 | Example-driven Design of Efficient Record Matching Queries | 2007 | VLDB | 0.00011471971 |
| 11,305 | TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching | 2023 | VLDB | 4.1945683e-05 |
| 7,725 | Data Cleaning in Microsoft SQL Server 2005 | 2005 | SIGMOD | 4.6670883e-05 |
| 5,434 | Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples | 2021 | SIGMOD | 5.5045402e-05 |
| 4,435 | Sampling Dirty Data for Matching Attributes | 2010 | SIGMOD | 6.1918164e-05 |
| 1,345 | Entity Matching: How Similar Is Similar | 2011 | VLDB | 0.00012468408 |
| 3,529 | Merging the Results of Approximate Match Operations | 2004 | VLDB | 7.0059524e-05 |
| 4,026 | Flexible String Matching Against Large Databases in Practice | 2004 | VLDB | 6.5169976e-05 |
| 280 | Eliminating Fuzzy Duplicates in Data Warehouses | 2002 | VLDB | 0.00029113044 |