Database Paper Browser

Back to papers

Robust and Efficient Fuzzy Match for Online Data Cleaning

Summary: Proposes a novel similarity function addressing limitations of fuzzy-match metrics for data cleaning. Develops an efficient fuzzy-match algorithm for real-time validation/cleansing of incoming tuples against reference tables; demonstrated on real datasets. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
3442
Venue
SIGMOD
Year
2003
Pagerank
0.00040637896
Overall Rank
155 | 98.93%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 57 citing papers.

Rank Citing Paper Year Venue Pagerank
149 Trio: A System for Integrated Management of Data, Accuracy, and Lineage 2005 CIDR 0.00041101118
229 Reference Reconciliation in Complex Information Spaces 2005 SIGMOD 0.00032242633
250 Efficient set joins on similarity predicates 2004 SIGMOD 0.00030661988
266 Efficient Exact Set-Similarity Joins 2006 VLDB 0.00029718727
420 InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables 2012 SIGMOD 0.00023719065
627 Management of Probabilistic Data: Foundations and Challenges 2007 PODS 0.00018959005
759 To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks 2006 SIGMOD 0.00017064615
814 Entity Resolution: Theory, Practice & Open Challenges 2012 VLDB 0.00016370594
1,159 Towards Certain Fixes with Editing Rules and Master Data 2010 VLDB 0.00013592813
1,202 VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams 2007 VLDB 0.00013326298
1,285 Neighborhood Based Fast Graph Search in Large Networks 2011 SIGMOD 0.00012833377
1,396 Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search 2012 SIGMOD 0.00012204748
1,533 Example-driven Design of Efficient Record Matching Queries 2007 VLDB 0.00011471971
1,585 Answering Table Augmentation Queries from Unstructured Lists on the Web 2009 VLDB 0.00011255098
2,193 Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently 2008 SIGMOD 9.3178557e-05
2,333 A Platform for Personal Information Management and Integration 2005 CIDR 9.0169986e-05
2,376 Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance 2010 SIGMOD 8.9424361e-05
2,386 Leveraging Aggregate Constraints For Deduplication 2007 SIGMOD 8.9231895e-05
2,592 Pass-Join: A Partition-based Method for Similarity Joins 2012 VLDB 8.4795761e-05
2,823 Interaction between Record Matching and Data Repairing 2011 SIGMOD 8.0593894e-05
3,226 Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance 2007 VLDB 7.3433307e-05
3,267 Benchmarking Declarative Approximate Selection Predicates 2007 SIGMOD 7.3058429e-05
3,328 Multi-column Substring Matching for Database Schema Translation 2006 VLDB 7.2174278e-05
3,529 Merging the Results of Approximate Match Operations 2004 VLDB 7.0059524e-05
3,712 MOMA - A Mapping-based Object Matching System 2007 CIDR 6.823134e-05
4,026 Flexible String Matching Against Large Databases in Practice 2004 VLDB 6.5169976e-05
4,137 Exploiting Content Redundancy for Web Information Extraction 2010 VLDB 6.4181549e-05
4,216 Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints 2010 VLDB 6.3521675e-05
4,438 Selectivity Estimation for Fuzzy String Predicates in Large Data Sets 2005 VLDB 6.1898903e-05
4,901 Probabilistic String Similarity Joins 2010 SIGMOD 5.8411648e-05
5,073 Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction 2011 SIGMOD 5.7177424e-05
5,179 SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints 2017 VLDB 5.6428428e-05
5,434 Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples 2021 SIGMOD 5.5045402e-05
5,486 Fast Foreign-Key Detection in Microsoft SQL Server PowerPivot for Excel 2014 VLDB 5.4811603e-05
5,794 Discovering Related Data At Scale 2021 VLDB 5.3245122e-05
5,796 Finding Frequent Items in Probabilistic Data 2008 SIGMOD 5.3240234e-05
5,869 Demonstration of Panda: A Weakly Supervised Entity Matching System 2021 VLDB 5.2959029e-05
5,987 Sampling Cube: A Framework for Statistical OLAP Over Sampling Data 2008 SIGMOD 5.2432535e-05
6,074 Pigeonring: A Principle for Faster Thresholded Similarity Search 2019 VLDB 5.2242306e-05
6,419 A Deferred Cleansing Method for RFID Data Analytics 2006 VLDB 5.0690363e-05
6,726 A Pivotal Prefix Based Filtering Algorithm for String Similarity Search 2014 SIGMOD 4.9484027e-05
7,588 Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases 2013 VLDB 4.7030914e-05
7,725 Data Cleaning in Microsoft SQL Server 2005 2005 SIGMOD 4.6670883e-05
7,777 Indexing Mixed Types for Approximate Retrieval 2005 VLDB 4.653704e-05
8,007 A Grammar-based Entity Representation Framework for Data Cleaning 2009 SIGMOD 4.6068018e-05
8,099 Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching 2023 VLDB 4.5859317e-05
8,137 Customizable and Scalable Fuzzy Join for Big Data 2019 VLDB 4.5774794e-05
9,274 Ranking Distributed Probabilistic Data 2009 SIGMOD 4.3646295e-05
9,567 META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion 2016 VLDB 4.3254416e-05
9,832 Balance-Aware Distributed String Similarity-Based Query Processing System 2019 VLDB 4.2751057e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 5 of 5 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers