Database Paper Browser

Back to papers

Deep Entity Matching with Pre-Trained Language Models

Summary: Ditto uses pre-trained transformers as sequence-pair classifiers for entity matching, beating SOTA by up to 29% F1. Adds domain highlighting, input-length summarization, and hard-example augmentation to boost with fewer labels; on 789k/412k records, Ditto reaches 96.5% F1. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
12570
Venue
VLDB
Year
2021
Pagerank
0.00033121824
Overall Rank
221 | 98.47%
DOI
10.14778/3421424.3421431

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 90 citing papers.

Rank Citing Paper Year Venue Pagerank
513 TURL: Table Understanding through Representation Learning 2021 VLDB 0.00021288342
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
1,541 Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes 2023 CIDR 0.00011456579
2,057 From Natural Language Processing to Neural Databases 2021 VLDB 9.6624862e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,517 Annotating Columns with Pre-trained Language Models 2022 SIGMOD 8.6092139e-05
2,587 Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks 2024 SIGMOD 8.4924618e-05
2,836 Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning 2023 VLDB 8.0443826e-05
3,335 DeepJoin: Joinable Table Discovery with Pre-trained Language Models 2023 VLDB 7.2065006e-05
3,396 Automatic Data Repair: Are We Ready to Deploy? 2024 VLDB 7.1455126e-05
3,640 Deep Learning for Blocking in Entity Matching: A Design Space Exploration 2021 VLDB 6.8891671e-05
3,876 The Design of an LLM-powered Unstructured Analytics System 2025 CIDR 6.6741456e-05
3,942 Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins 2022 VLDB 6.6114622e-05
3,995 How Large Language Models Will Disrupt Data Management 2023 VLDB 6.5513237e-05
4,018 Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching 2023 VLDB 6.5244015e-05
4,212 Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration 2023 SIGMOD 6.3555142e-05
4,355 LargeEA: Aligning Entities for Large-scale Knowledge Graphs 2022 VLDB 6.259483e-05
4,630 Knowledge Graphs 2021: A Data Odyssey 2021 VLDB 6.0348379e-05
4,837 Entity Resolution with Hierarchical Graph Attention Networks 2022 SIGMOD 5.8892326e-05
5,024 Towards Distribution-aware Query Answering in Data Markets 2022 VLDB 5.7535043e-05
5,096 Auto-Transform: Learning-to-Transform by Patterns 2020 VLDB 5.7011825e-05
5,280 Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V 2023 VLDB 5.5896735e-05
5,282 Deep Indexed Active Learning for Matching Heterogeneous Entity Representations 2022 VLDB 5.5864206e-05
5,434 Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples 2021 SIGMOD 5.5045402e-05
5,449 Transformers for Tabular Data Representation: A Tutorial on Models and Applications 2022 VLDB 5.5008652e-05
5,533 Dual-Objective Fine-Tuning of BERT for Entity Matching 2021 VLDB 5.4544359e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,092 Observatory: Characterizing Embeddings of Relational Tables 2024 VLDB 5.2138566e-05
6,408 Explaining Link Prediction Systems based on Knowledge Graph Embeddings 2022 SIGMOD 5.0763482e-05
6,553 How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses 2024 VLDB 5.0157344e-05
6,569 Domain Adaptation for Deep Entity Resolution 2022 SIGMOD 5.0065379e-05
6,711 Analyzing How BERT Performs Entity Matching 2022 VLDB 4.9517546e-05
6,800 DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models 2024 SIGMOD 4.9231471e-05
6,894 TableDC: Deep Clustering for Tabular Data 2025 SIGMOD 4.8925595e-05
7,052 Pre-trained Embeddings for Entity Resolution: An Experimental Analysis 2023 VLDB 4.8497453e-05
8,008 Entity Resolution On-Demand 2022 VLDB 4.6067684e-05
8,099 Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching 2023 VLDB 4.5859317e-05
8,153 Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation 2022 VLDB 4.574554e-05
8,204 ELEET: Efficient Learned Query Execution over Text and Tables 2024 VLDB 4.5594273e-05
8,208 SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions 2024 CIDR 4.5581306e-05
8,384 Consistent and Flexible Selectivity Estimation for High-Dimensional Data 2021 SIGMOD 4.5304673e-05
8,406 DADER: Hands-Off Entity Resolution with Domain Adaptation 2022 VLDB 4.5220083e-05
8,436 A Critical Re-evaluation of Neural Methods for Entity Alignment 2022 VLDB 4.5138915e-05
8,906 Mining Geospatial Relationships from Text 2023 SIGMOD 4.427232e-05
8,908 Deep Active Alignment of Knowledge Graph Entities and Schemata 2023 SIGMOD 4.427232e-05
8,910 R2D2: Reducing Redundancy and Duplication in Data Lakes 2023 SIGMOD 4.427232e-05
8,911 PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching 2023 VLDB 4.427232e-05
8,958 FlexER: Flexible Entity Resolution for Multiple Intents 2023 SIGMOD 4.4210635e-05
9,077 VerifAI: Verified Generative AI 2024 CIDR 4.4010762e-05
9,152 Doctopus: Budget-aware Structural Table Extraction from Unstructured Documents 2025 VLDB 4.3849295e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 11 of 11 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers