Eliminating Fuzzy Duplicates in Data Warehouses
Summary: Eliminating fuzzy duplicates in data-warehouse dimensional tables by leveraging hierarchies to resolve domain-specific abbreviations. A scalable, high-quality duplicate-elimination algorithm that outperforms generic text-similarity, validated on real operational DW datasets. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 37 of 37 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 8 of 8 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 67 | The Merge/Purge Problem for Large Databases | 1995 | SIGMOD | 0.00061348205 |
| 112 | Potter's Wheel: An Interactive Data Cleaning System | 2001 | VLDB | 0.00047045036 |
| 125 | Approximate String Joins in a Database (Almost) for Free | 2001 | VLDB | 0.00044847972 |
| 150 | Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity | 1998 | SIGMOD | 0.00041055843 |
| 199 | Declarative Data Cleaning: Language, Model, and Algorithms | 2001 | VLDB | 0.00035041015 |
| 303 | Generic Schema Matching with Cupid | 2001 | VLDB | 0.00028301477 |
| 637 | Automatic segmentation of text into structured records | 2001 | SIGMOD | 0.00018824614 |
| 1,336 | Clustering Categorical Data: An Approach Based on Dynamical Systems | 1998 | VLDB | 0.00012498064 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 4,026 | Flexible String Matching Against Large Databases in Practice | 2004 | VLDB | 6.5169976e-05 |
| 1,345 | Entity Matching: How Similar Is Similar | 2011 | VLDB | 0.00012468408 |
| 4,619 | Crowd-Based Deduplication: An Adaptive Approach | 2015 | SIGMOD | 6.0444854e-05 |
| 3,528 | Distributed Data Deduplication | 2016 | VLDB | 7.0066139e-05 |
| 5,235 | Industry-Scale Duplicate Detection | 2008 | VLDB | 5.6115647e-05 |
| 6,042 | MDedup: Duplicate Detection with Matching Dependencies | 2020 | VLDB | 5.2405269e-05 |
| 7,725 | Data Cleaning in Microsoft SQL Server 2005 | 2005 | SIGMOD | 4.6670883e-05 |
| 936 | Framework for Evaluating Clustering Algorithms in Duplicate Detection | 2009 | VLDB | 0.0001521549 |
| 3,360 | Modeling and Querying Possible Repairs in Duplicate Detection | 2009 | VLDB | 7.1742067e-05 |
| 155 | Robust and Efficient Fuzzy Match for Online Data Cleaning | 2003 | SIGMOD | 0.00040637896 |