Database Paper Browser

Back to papers

How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses

Summary: First systematic empirical study of categorical duplicates (e.g., "CA" vs "California") on ML classification: labeled corpus of 1,262 categorical columns and a 16-dataset benchmark across five classifiers and five encoders. Finds logistic regression and similarity encoding robust to duplicates while one-hot with high-capacity models degrade; provides benchmarks and actionable takeaways for AutoML and data-prep. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13383
Venue
VLDB
Year
2024
Pagerank
5.0157344e-05
Overall Rank
6,553 | 54.42%
DOI
10.14778/3648160.3648178

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 2 of 2 citing papers.

Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 19 of 19 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
192 HoloClean: Holistic Data Repairs with Probabilistic Inference 2017 VLDB 0.00035728858
221 Deep Entity Matching with Pre-Trained Language Models 2021 VLDB 0.00033121824
300 Deep Learning for Entity Matching: A Design Space Exploration 2018 SIGMOD 0.00028441466
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
712 Magellan: Toward Building Entity Matching Management Systems 2016 VLDB 0.00017732426
1,627 Data Cleaning: Overview and Emerging Challenges 2016 SIGMOD 0.00011086905
1,831 Synthesizing Entity Matching Rules by Examples 2018 VLDB 0.00010384082
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,767 A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching 2020 SIGMOD 8.1513883e-05
3,140 ZeroER: Entity Resolution using Zero Labeled Examples 2020 SIGMOD 7.4841763e-05
3,478 Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations 2018 VLDB 7.054159e-05
3,861 Generating Concise Entity Matching Rules 2017 SIGMOD 6.6878164e-05
4,212 Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration 2023 SIGMOD 6.3555142e-05
4,402 Smurf: Self-Service String Matching Using Random Forests 2019 VLDB 6.2195162e-05
5,242 Towards Benchmarking Feature Type Inference for AutoML Platforms 2021 SIGMOD 5.6074743e-05
5,434 Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples 2021 SIGMOD 5.5045402e-05
5,929 ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning 2016 SIGMOD 5.2682177e-05
5,981 DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python 2021 SIGMOD 5.2448986e-05
7,812 Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs 2017 SIGMOD 4.6443197e-05
Previous Page 1 / 1 Next

Semantically Similar Papers