Deep Entity Matching with Pre-Trained Language Models

Summary: Ditto uses pre-trained transformers as sequence-pair classifiers for entity matching, beating SOTA by up to 29% F1. Adds domain highlighting, input-length summarization, and hard-example augmentation to boost with fewer labels; on 789k/412k records, Ditto reaches 96.5% F1. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 12570
Venue: VLDB
Year: 2021
Pagerank: 0.00033121824
Overall Rank: 221 | 98.47%
DOI: 10.14778/3421424.3421431

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 40 of 90 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
9,235	ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries	2025	VLDB	4.3690661e-05
9,399	TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations	2025	VLDB	4.3441378e-05
9,409	Ground Truth Inference for Weakly Supervised Entity Matching	2023	SIGMOD	4.3441378e-05
9,434	Rock: Cleaning Data by Embedding ML in Logic Rules	2024	SIGMOD	4.3430376e-05
9,460	The Battleship Approach to the Low Resource Entity Matching Problem	2023	SIGMOD	4.3366491e-05
9,461	BrewER: Entity Resolution On-Demand	2023	VLDB	4.3366491e-05
9,479	Data Imputation with Limited Data Redundancy Using Data Lakes	2025	VLDB	4.3341665e-05
9,492	Lingua Manga : A Generic Large Language Model Centric System for Data Curation	2023	VLDB	4.3341665e-05
9,683	Hierarchical Entity Resolution using an Oracle	2022	SIGMOD	4.3047774e-05
9,777	Data Augmentation for ML-driven Data Preparation and Integration	2021	VLDB	4.2856106e-05
9,846	HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs	2025	VLDB	4.2721228e-05
9,847	Discovering Top-k Relevant and Diversified Rules	2024	SIGMOD	4.2721228e-05
9,963	Parallel Rule Discovery from Large Datasets by Sampling	2022	SIGMOD	4.2294678e-05
10,040	3dSAGER: Geospatial Entity Resolution over 3D Objects	2026	SIGMOD	4.1945683e-05
10,091	LLM-Powered Interactive Graph Search: A Scalable and Practical Approach	2026	SIGMOD	4.1945683e-05
10,197	Qualitative Join Discovery in Data Lakes using Examples	2026	SIGMOD	4.1945683e-05
10,443	LLM-Matcher: A Name-Based Schema Matching Tool using Large Language Models	2025	SIGMOD	4.1945683e-05
10,446	MiniClean: A Single-Machine System for Cleaning Big Graphs	2025	SIGMOD	4.1945683e-05
10,486	Rule-Based Graph Cleaning with GPUs on a Single Machine	2025	SIGMOD	4.1945683e-05
10,498	PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models	2025	SIGMOD	4.1945683e-05
10,595	Optimized Batch Prompting for Cost-effective LLMs	2025	VLDB	4.1945683e-05
10,617	Deduplicated Sampling On-Demand	2025	VLDB	4.1945683e-05
10,624	Evaluating Methods for Efficient Entity Count Estimation	2025	VLDB	4.1945683e-05
10,645	OpenForge: Probabilistic Metadata Integration	2025	VLDB	4.1945683e-05
10,723	UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow	2025	VLDB	4.1945683e-05
10,939	Relative Keys: Putting Feature Explanation into Context	2024	SIGMOD	4.1945683e-05
11,006	FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data	2024	VLDB	4.1945683e-05
11,047	Blocker and Matcher Can Mutually Benefit: A Co-Learning Framework for Low-Resource Entity Resolution	2024	VLDB	4.1945683e-05
11,054	Enriching Relations with Additional Attributes for ER	2024	VLDB	4.1945683e-05
11,087	Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching	2024	VLDB	4.1945683e-05
11,117	FairEM360: A Suite for Responsible Entity Matching	2024	VLDB	4.1945683e-05
11,183	Matching Roles from Temporal Data	2023	SIGMOD	4.1945683e-05
11,206	When Automatic Filtering Comes to the Rescue: Pre-Computing Company Competitor Pairs in Owler	2023	SIGMOD	4.1945683e-05
11,223	Splitting Tuples of Mismatched Entities	2023	SIGMOD	4.1945683e-05
11,230	VersaMatch: Ontology Matching with Weak Supervision	2023	VLDB	4.1945683e-05
11,234	Learning and Deducing Temporal Orders	2023	VLDB	4.1945683e-05
11,342	FILA: Online Auditing of Machine Learning Model Accuracy under Finite Labelling Budget	2022	SIGMOD	4.1945683e-05
11,343	SPINE: Scaling up Programming-by-Negative-Example for String Filtering and Transformation	2022	SIGMOD	4.1945683e-05
11,400	CERTEM: Explaining and Debugging Black-box Entity Resolution Systems with CERTA	2022	VLDB	4.1945683e-05
11,515	From Papers to Practice: The openclean Open-Source Data Cleaning Library	2021	VLDB	4.1945683e-05

Outgoing Citations (Sorted by Pagerank)

Showing 11 of 11 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
263	CrowdER: Crowdsourcing Entity Resolution	2012	VLDB	0.00029862413
267	Human-powered Sorts and Joins	2012	VLDB	0.00029690405
300	Deep Learning for Entity Matching: A Design Space Exploration	2018	SIGMOD	0.00028441466
319	Evaluation of entity resolution approaches on real-world match problems	2010	VLDB	0.00027781866
643	Corleone: Hands-Off Crowdsourcing for Entity Matching	2014	SIGMOD	0.00018754451
712	Magellan: Toward Building Entity Matching Management Systems	2016	VLDB	0.00017732426
754	Distributed Representations of Tuples for Entity Resolution	2018	VLDB	0.00017117211
1,345	Entity Matching: How Similar Is Similar	2011	VLDB	0.00012468408
1,831	Synthesizing Entity Matching Rules by Examples	2018	VLDB	0.00010384082
2,767	A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching	2020	SIGMOD	8.1513883e-05
3,582	NADEEF/ER: Generic and Interactive Entity Resolution	2014	SIGMOD	6.9479263e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
2,767	A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching	2020	SIGMOD	8.1513883e-05
3,578	Efficient Approximate Entity Extraction with Edit Distance Constraints	2009	SIGMOD	6.9503858e-05
9,460	The Battleship Approach to the Low Resource Entity Matching Problem	2023	SIGMOD	4.3366491e-05
3,640	Deep Learning for Blocking in Entity Matching: A Design Space Exploration	2021	VLDB	6.8891671e-05
6,569	Domain Adaptation for Deep Entity Resolution	2022	SIGMOD	5.0065379e-05
7,052	Pre-trained Embeddings for Entity Resolution: An Experimental Analysis	2023	VLDB	4.8497453e-05
4,837	Entity Resolution with Hierarchical Graph Attention Networks	2022	SIGMOD	5.8892326e-05
6,711	Analyzing How BERT Performs Entity Matching	2022	VLDB	4.9517546e-05
300	Deep Learning for Entity Matching: A Design Space Exploration	2018	SIGMOD	0.00028441466
5,533	Dual-Objective Fine-Tuning of BERT for Entity Matching	2021	VLDB	5.4544359e-05