Back to papers
How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses
Summary: First systematic empirical study of categorical duplicates (e.g., "CA" vs "California") on ML classification: labeled corpus of 1,262 categorical columns and a 16-dataset benchmark across five classifiers and five encoders. Finds logistic regression and similarity encoding robust to duplicates while one-hot with high-capacity models degrade; provides benchmarks and actionable takeaways for AutoML and data-prep.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13383
- Venue
- VLDB
- Year
- 2024
- Pagerank
- 5.0157344e-05
- Overall Rank
- 6,553 | 54.42%
- DOI
-
10.14778/3648160.3648178
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 19 of 19 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 192 |
HoloClean: Holistic Data Repairs with Probabilistic Inference |
2017 |
VLDB |
0.00035728858 |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 300 |
Deep Learning for Entity Matching: A Design Space Exploration |
2018 |
SIGMOD |
0.00028441466 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 712 |
Magellan: Toward Building Entity Matching Management Systems |
2016 |
VLDB |
0.00017732426 |
| 1,627 |
Data Cleaning: Overview and Emerging Challenges |
2016 |
SIGMOD |
0.00011086905 |
| 1,831 |
Synthesizing Entity Matching Rules by Examples |
2018 |
VLDB |
0.00010384082 |
| 2,349 |
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation |
2021 |
VLDB |
8.9876423e-05 |
| 2,767 |
A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching |
2020 |
SIGMOD |
8.1513883e-05 |
| 3,140 |
ZeroER: Entity Resolution using Zero Labeled Examples |
2020 |
SIGMOD |
7.4841763e-05 |
| 3,478 |
Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations |
2018 |
VLDB |
7.054159e-05 |
| 3,861 |
Generating Concise Entity Matching Rules |
2017 |
SIGMOD |
6.6878164e-05 |
| 4,212 |
Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration |
2023 |
SIGMOD |
6.3555142e-05 |
| 4,402 |
Smurf: Self-Service String Matching Using Random Forests |
2019 |
VLDB |
6.2195162e-05 |
| 5,242 |
Towards Benchmarking Feature Type Inference for AutoML Platforms |
2021 |
SIGMOD |
5.6074743e-05 |
| 5,434 |
Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples |
2021 |
SIGMOD |
5.5045402e-05 |
| 5,929 |
ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning |
2016 |
SIGMOD |
5.2682177e-05 |
| 5,981 |
DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python |
2021 |
SIGMOD |
5.2448986e-05 |
| 7,812 |
Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs |
2017 |
SIGMOD |
4.6443197e-05 |
Semantically Similar Papers