Database Paper Browser

Back to papers

Deep Entity Matching with Pre-Trained Language Models

Summary: Ditto uses pre-trained transformers as sequence-pair classifiers for entity matching, beating SOTA by up to 29% F1. Adds domain highlighting, input-length summarization, and hard-example augmentation to boost with fewer labels; on 789k/412k records, Ditto reaches 96.5% F1. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
12570
Venue
VLDB
Year
2021
Pagerank
0.00033121824
Overall Rank
221 | 98.47%
DOI
10.14778/3421424.3421431

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 40 of 90 citing papers.

Rank Citing Paper Year Venue Pagerank
9,235 ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries 2025 VLDB 4.3690661e-05
9,399 TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations 2025 VLDB 4.3441378e-05
9,409 Ground Truth Inference for Weakly Supervised Entity Matching 2023 SIGMOD 4.3441378e-05
9,434 Rock: Cleaning Data by Embedding ML in Logic Rules 2024 SIGMOD 4.3430376e-05
9,460 The Battleship Approach to the Low Resource Entity Matching Problem 2023 SIGMOD 4.3366491e-05
9,461 BrewER: Entity Resolution On-Demand 2023 VLDB 4.3366491e-05
9,479 Data Imputation with Limited Data Redundancy Using Data Lakes 2025 VLDB 4.3341665e-05
9,492 Lingua Manga : A Generic Large Language Model Centric System for Data Curation 2023 VLDB 4.3341665e-05
9,683 Hierarchical Entity Resolution using an Oracle 2022 SIGMOD 4.3047774e-05
9,777 Data Augmentation for ML-driven Data Preparation and Integration 2021 VLDB 4.2856106e-05
9,846 HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs 2025 VLDB 4.2721228e-05
9,847 Discovering Top-k Relevant and Diversified Rules 2024 SIGMOD 4.2721228e-05
9,963 Parallel Rule Discovery from Large Datasets by Sampling 2022 SIGMOD 4.2294678e-05
10,040 3dSAGER: Geospatial Entity Resolution over 3D Objects 2026 SIGMOD 4.1945683e-05
10,091 LLM-Powered Interactive Graph Search: A Scalable and Practical Approach 2026 SIGMOD 4.1945683e-05
10,197 Qualitative Join Discovery in Data Lakes using Examples 2026 SIGMOD 4.1945683e-05
10,443 LLM-Matcher: A Name-Based Schema Matching Tool using Large Language Models 2025 SIGMOD 4.1945683e-05
10,446 MiniClean: A Single-Machine System for Cleaning Big Graphs 2025 SIGMOD 4.1945683e-05
10,486 Rule-Based Graph Cleaning with GPUs on a Single Machine 2025 SIGMOD 4.1945683e-05
10,498 PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models 2025 SIGMOD 4.1945683e-05
10,595 Optimized Batch Prompting for Cost-effective LLMs 2025 VLDB 4.1945683e-05
10,617 Deduplicated Sampling On-Demand 2025 VLDB 4.1945683e-05
10,624 Evaluating Methods for Efficient Entity Count Estimation 2025 VLDB 4.1945683e-05
10,645 OpenForge: Probabilistic Metadata Integration 2025 VLDB 4.1945683e-05
10,723 UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow 2025 VLDB 4.1945683e-05
10,939 Relative Keys: Putting Feature Explanation into Context 2024 SIGMOD 4.1945683e-05
11,006 FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data 2024 VLDB 4.1945683e-05
11,047 Blocker and Matcher Can Mutually Benefit: A Co-Learning Framework for Low-Resource Entity Resolution 2024 VLDB 4.1945683e-05
11,054 Enriching Relations with Additional Attributes for ER 2024 VLDB 4.1945683e-05
11,087 Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching 2024 VLDB 4.1945683e-05
11,117 FairEM360: A Suite for Responsible Entity Matching 2024 VLDB 4.1945683e-05
11,183 Matching Roles from Temporal Data 2023 SIGMOD 4.1945683e-05
11,206 When Automatic Filtering Comes to the Rescue: Pre-Computing Company Competitor Pairs in Owler 2023 SIGMOD 4.1945683e-05
11,223 Splitting Tuples of Mismatched Entities 2023 SIGMOD 4.1945683e-05
11,230 VersaMatch: Ontology Matching with Weak Supervision 2023 VLDB 4.1945683e-05
11,234 Learning and Deducing Temporal Orders 2023 VLDB 4.1945683e-05
11,342 FILA: Online Auditing of Machine Learning Model Accuracy under Finite Labelling Budget 2022 SIGMOD 4.1945683e-05
11,343 SPINE: Scaling up Programming-by-Negative-Example for String Filtering and Transformation 2022 SIGMOD 4.1945683e-05
11,400 CERTEM: Explaining and Debugging Black-box Entity Resolution Systems with CERTA 2022 VLDB 4.1945683e-05
11,515 From Papers to Practice: The openclean Open-Source Data Cleaning Library 2021 VLDB 4.1945683e-05
Previous Page 2 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 11 of 11 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers