Database Paper Browser

Back to papers

Eliminating Fuzzy Duplicates in Data Warehouses

Summary: Eliminating fuzzy duplicates in data-warehouse dimensional tables by leveraging hierarchies to resolve domain-specific abbreviations. A scalable, high-quality duplicate-elimination algorithm that outperforms generic text-similarity, validated on real operational DW datasets. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
8877
Venue
VLDB
Year
2002
Pagerank
0.00029113044
Overall Rank
280 | 98.06%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 37 of 37 citing papers.

Rank Citing Paper Year Venue Pagerank
155 Robust and Efficient Fuzzy Match for Online Data Cleaning 2003 SIGMOD 0.00040637896
229 Reference Reconciliation in Complex Information Spaces 2005 SIGMOD 0.00032242633
250 Efficient set joins on similarity predicates 2004 SIGMOD 0.00030661988
266 Efficient Exact Set-Similarity Joins 2006 VLDB 0.00029718727
319 Evaluation of entity resolution approaches on real-world match problems 2010 VLDB 0.00027781866
509 On Active Learning of Record Matching Packages 2010 SIGMOD 0.00021409518
627 Management of Probabilistic Data: Foundations and Challenges 2007 PODS 0.00018959005
702 Reasoning about Record Matching Rules 2009 VLDB 0.00017918203
1,221 A Web of Concepts 2009 PODS 0.00013219242
1,627 Data Cleaning: Overview and Emerging Challenges 2016 SIGMOD 0.00011086905
1,908 Information-Theoretic Tools for Mining Database Structure from Large Data Sets 2004 SIGMOD 0.00010126101
1,970 Approximate Lineage for Probabilistic Databases 2008 VLDB 9.896375e-05
2,386 Leveraging Aggregate Constraints For Deduplication 2007 SIGMOD 8.9231895e-05
2,405 Linking Temporal Records 2011 VLDB 8.8815018e-05
2,589 DogmatiX Tracks down Duplicates in XML 2005 SIGMOD 8.4847146e-05
2,722 Progressive Approach to Relational Entity Resolution 2014 VLDB 8.2338356e-05
3,177 Evaluating Entity Resolution Results 2010 VLDB 7.4367331e-05
3,267 Benchmarking Declarative Approximate Selection Predicates 2007 SIGMOD 7.3058429e-05
3,528 Distributed Data Deduplication 2016 VLDB 7.0066139e-05
3,631 On-the-Fly Entity-Aware Query Processing in the Presence of Linkage 2010 VLDB 6.9014378e-05
3,645 Large-Scale Collective Entity Matching 2011 VLDB 6.8853274e-05
3,712 MOMA - A Mapping-based Object Matching System 2007 CIDR 6.823134e-05
3,838 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters 2006 SIGMOD 6.7134945e-05
4,438 Selectivity Estimation for Fuzzy String Predicates in Large Data Sets 2005 VLDB 6.1898903e-05
5,235 Industry-Scale Duplicate Detection 2008 VLDB 5.6115647e-05
5,586 QuERy: A Framework for Integrating Entity Resolution with Query Processing 2016 VLDB 5.4219548e-05
6,175 Query-Driven Approach to Entity Resolution 2013 VLDB 5.169496e-05
6,690 Parallel Discrepancy Detection and Incremental Detection 2021 VLDB 4.9621556e-05
7,013 Qualitative Data Cleaning 2016 VLDB 4.8619024e-05
7,061 Serving Deep Learning Models with Deduplication from Relational Databases 2022 VLDB 4.8463881e-05
7,777 Indexing Mixed Types for Approximate Retrieval 2005 VLDB 4.653704e-05
9,563 Towards a Unified Framework for String Similarity Joins 2019 VLDB 4.3254416e-05
10,499 Privacy and Accuracy-Aware AI/ML Model Deduplication 2025 SIGMOD 4.1945683e-05
11,054 Enriching Relations with Additional Attributes for ER 2024 VLDB 4.1945683e-05
11,162 Towards Better Bounds for Finding Quasi-Identifiers * 2023 PODS 4.1945683e-05
12,048 GRDB: A System for Declarative and Interactive Analysis of Noisy Information Networks 2013 SIGMOD 4.1945683e-05
12,194 Web Scale Taxonomy Cleansing 2011 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 8 of 8 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers

Overall Rank Paper Year Venue Pagerank
4,026 Flexible String Matching Against Large Databases in Practice 2004 VLDB 6.5169976e-05
1,345 Entity Matching: How Similar Is Similar 2011 VLDB 0.00012468408
4,619 Crowd-Based Deduplication: An Adaptive Approach 2015 SIGMOD 6.0444854e-05
3,528 Distributed Data Deduplication 2016 VLDB 7.0066139e-05
5,235 Industry-Scale Duplicate Detection 2008 VLDB 5.6115647e-05
6,042 MDedup: Duplicate Detection with Matching Dependencies 2020 VLDB 5.2405269e-05
7,725 Data Cleaning in Microsoft SQL Server 2005 2005 SIGMOD 4.6670883e-05
936 Framework for Evaluating Clustering Algorithms in Duplicate Detection 2009 VLDB 0.0001521549
3,360 Modeling and Querying Possible Repairs in Duplicate Detection 2009 VLDB 7.1742067e-05
155 Robust and Efficient Fuzzy Match for Online Data Cleaning 2003 SIGMOD 0.00040637896