Back to papers
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation
Summary: RPT: denoising tuple-to-tuple autoencoder; Transformer encoder-decoder unifies BERT and GPT. Pre-trained, it enables data cleaning, auto-completion, and normalization and annotation, plus few-shot and collaborative ER/IE.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12315
- Venue
- VLDB
- Year
- 2021
- Pagerank
- 8.9876423e-05
- Overall Rank
- 2,349 | 83.66%
- DOI
-
10.14778/3457390.3457391
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 27 of 27 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,541 |
Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes |
2023 |
CIDR |
0.00011456579 |
| 1,643 |
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex |
2022 |
VLDB |
0.0001104256 |
| 2,517 |
Annotating Columns with Pre-trained Language Models |
2022 |
SIGMOD |
8.6092139e-05 |
| 3,335 |
DeepJoin: Joinable Table Discovery with Pre-trained Language Models |
2023 |
VLDB |
7.2065006e-05 |
| 4,934 |
From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management |
2022 |
VLDB |
5.8198826e-05 |
| 5,449 |
Transformers for Tabular Data Representation: A Tutorial on Models and Applications |
2022 |
VLDB |
5.5008652e-05 |
| 5,509 |
Can Large Language Models Predict Data Correlations from Column Names? |
2023 |
VLDB |
5.4703368e-05 |
| 6,092 |
Observatory: Characterizing Embeddings of Relational Tables |
2024 |
VLDB |
5.2138566e-05 |
| 6,280 |
Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks |
2023 |
VLDB |
5.1290457e-05 |
| 6,389 |
Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs |
2024 |
VLDB |
5.0844009e-05 |
| 6,553 |
How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses |
2024 |
VLDB |
5.0157344e-05 |
| 6,569 |
Domain Adaptation for Deep Entity Resolution |
2022 |
SIGMOD |
5.0065379e-05 |
| 6,737 |
Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4 |
2023 |
VLDB |
4.9457488e-05 |
| 6,800 |
DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models |
2024 |
SIGMOD |
4.9231471e-05 |
| 7,020 |
LLM for Data Management |
2024 |
VLDB |
4.8595728e-05 |
| 8,052 |
Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models |
2024 |
VLDB |
4.5953106e-05 |
| 8,186 |
E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model |
2025 |
VLDB |
4.5651684e-05 |
| 8,208 |
SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions |
2024 |
CIDR |
4.5581306e-05 |
| 8,523 |
Controllable Tabular Data Synthesis Using Diffusion Models |
2024 |
SIGMOD |
4.4937074e-05 |
| 8,743 |
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning |
2024 |
SIGMOD |
4.456315e-05 |
| 9,348 |
GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models |
2024 |
SIGMOD |
4.3526427e-05 |
| 9,399 |
TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations |
2025 |
VLDB |
4.3441378e-05 |
| 9,476 |
Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents |
2025 |
SIGMOD |
4.3341665e-05 |
| 9,479 |
Data Imputation with Limited Data Redundancy Using Data Lakes |
2025 |
VLDB |
4.3341665e-05 |
| 9,777 |
Data Augmentation for ML-driven Data Preparation and Integration |
2021 |
VLDB |
4.2856106e-05 |
| 10,610 |
Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation |
2025 |
VLDB |
4.1945683e-05 |
| 11,347 |
OpenTFV: An Open Domain Table-Based Fact Verification System |
2022 |
SIGMOD |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 24 of 24 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 192 |
HoloClean: Holistic Data Repairs with Probabilistic Inference |
2017 |
VLDB |
0.00035728858 |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 265 |
A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification |
2005 |
SIGMOD |
0.00029763412 |
| 300 |
Deep Learning for Entity Matching: A Design Space Exploration |
2018 |
SIGMOD |
0.00028441466 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 555 |
Discovering Denial Constraints |
2013 |
VLDB |
0.00020254908 |
| 712 |
Magellan: Toward Building Entity Matching Management Systems |
2016 |
VLDB |
0.00017732426 |
| 754 |
Distributed Representations of Tuples for Entity Resolution |
2018 |
VLDB |
0.00017117211 |
| 833 |
Guided Data Repair |
2011 |
VLDB |
0.00016138432 |
| 881 |
Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes |
2013 |
SIGMOD |
0.00015661103 |
| 1,159 |
Towards Certain Fixes with Editing Rules and Master Data |
2010 |
VLDB |
0.00013592813 |
| 1,277 |
The Data Civilizer System |
2017 |
CIDR |
0.00012879695 |
| 1,546 |
KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing |
2015 |
SIGMOD |
0.00011446851 |
| 1,612 |
Detecting Data Errors: Where are we and what needs to be done? |
2016 |
VLDB |
0.00011142794 |
| 1,894 |
Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning |
2020 |
VLDB |
0.0001018378 |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 2,209 |
Data Integration: After the Teenage Years |
2017 |
PODS |
9.2868035e-05 |
| 2,460 |
Combining Quantitative and Logical Data Cleaning |
2016 |
VLDB |
8.7617484e-05 |
| 2,968 |
Raha: A Configuration-Free Error Detection System |
2019 |
SIGMOD |
7.7985097e-05 |
| 3,140 |
ZeroER: Entity Resolution using Zero Labeled Examples |
2020 |
SIGMOD |
7.4841763e-05 |
| 3,192 |
Towards Dependable Data Repairing with Fixing Rules |
2014 |
SIGMOD |
7.4095761e-05 |
| 5,192 |
Pattern Functional Dependencies for Data Cleaning |
2020 |
VLDB |
5.6375087e-05 |
| 6,534 |
Automatic Rule Refinement for Information Extraction |
2010 |
VLDB |
5.0244622e-05 |
| 8,517 |
Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace |
2017 |
VLDB |
4.4943871e-05 |
Semantically Similar Papers