Annotating Columns with Pre-trained Language Models

Summary: Doduo, a multi-task framework using pre-trained language models, annotates table columns and their relations from the table. Achieves SOTA on two benchmarks for column types and relations, with up to 4.0% and 11.9% gains, using 8 tokens per column. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 6359
Venue: SIGMOD
Year: 2022
Pagerank: 8.6155767e-05
Overall Rank: 2,513 | 82.54%
DOI: 10.1145/3514221.3517906

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 23 of 23 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
2,585	Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks	2024	SIGMOD	8.4909917e-05
2,842	Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning	2023	VLDB	8.0366354e-05
3,001	SANTOS: Relationship-based Semantic Table Union Search	2023	SIGMOD	7.739698e-05
3,003	Chorus: Foundation Models for Unified Data Discovery and Exploration	2024	VLDB	7.7358219e-05
3,326	DeepJoin: Joinable Table Discovery with Pre-trained Language Models	2023	VLDB	7.2148323e-05
3,639	The Design of an LLM-powered Unstructured Analytics System	2025	CIDR	6.8886648e-05
4,860	Integrating Data Lake Tables	2023	VLDB	5.867964e-05
5,001	GenRewrite: Query Rewriting via Large Language Models	2026	SIGMOD	5.7634197e-05
5,098	ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models	2024	VLDB	5.6943033e-05
5,457	Transformers for Tabular Data Representation: A Tutorial on Models and Applications	2022	VLDB	5.4960654e-05
6,071	Observatory: Characterizing Embeddings of Relational Tables	2024	VLDB	5.2231739e-05
7,027	Mind the Data Gap: Bridging LLMs to Enterprise Data Integration	2025	CIDR	4.8524216e-05
7,045	Magneto: Combining Small and Large Language Models for Schema Matching	2025	VLDB	4.8474104e-05
8,576	RECA: Related Tables Enhanced Column Semantic Type Annotation Framework	2023	VLDB	4.4879383e-05
8,852	Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation	2023	SIGMOD	4.4313992e-05
10,109	Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations	2026	SIGMOD	4.1905499e-05
10,142	AutoDDG: Automated Dataset Description Generation using Large Language Models	2026	SIGMOD	4.1905499e-05
10,507	PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models	2025	SIGMOD	4.1905499e-05
10,519	Table Overlap Estimation through Graph Embeddings	2025	SIGMOD	4.1905499e-05
10,521	Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables	2025	SIGMOD	4.1905499e-05
10,666	LLMLog: Advanced Log Template Generation via LLM-driven Multi-Round Annotation	2025	VLDB	4.1905499e-05
10,759	Cents: A Flexible and Cost-Effective Framework for LLM-Based Table Understanding	2025	VLDB	4.1905499e-05
11,207	Steered Training Data Generation for Learned Semantic Type Detection	2023	SIGMOD	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 14 of 14 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
62	Freebase: A Collaboratively Created Graph Database For Structuring Human Knowledge	2008	SIGMOD	0.00064239035
219	Deep Entity Matching with Pre-Trained Language Models	2021	VLDB	0.00033354456
365	Annotating and Searching Web Tables Using Entities, Types and Relationships	2010	VLDB	0.00025616694
383	COMA - A system for flexible combination of schema matching approaches	2002	VLDB	0.00024802484
514	TURL: Table Understanding through Representation Learning	2021	VLDB	0.00021280726
1,005	Recovering Semantics of Tables on the Web	2011	VLDB	0.00014694038
1,274	The Data Civilizer System	2017	CIDR	0.00012869297
1,481	Automating Large-Scale Data Quality Verification	2018	VLDB	0.00011715754
1,914	Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks	2020	SIGMOD	0.00010111859
2,348	RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation	2021	VLDB	8.9903659e-05
2,737	Open Data Integration	2018	VLDB	8.2053894e-05
2,895	Sato: Contextual Semantic Type Detection in Tables	2020	VLDB	7.9539265e-05
3,826	Automatic Discovery of Attributes in Relational Databases	2011	SIGMOD	6.7204879e-05
5,538	Data-Driven Domain Discovery for Structured Datasets	2020	VLDB	5.4520759e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
5,457	Transformers for Tabular Data Representation: A Tutorial on Models and Applications	2022	VLDB	5.4960654e-05
8,892	Generation of Training Examples for Tabular Natural Language Inference	2023	SIGMOD	4.4233018e-05
365	Annotating and Searching Web Tables Using Entities, Types and Relationships	2010	VLDB	0.00025616694
8,915	Making Table Understanding Work in Practice	2022	CIDR	4.4229886e-05
5,098	ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models	2024	VLDB	5.6943033e-05
3,326	DeepJoin: Joinable Table Discovery with Pre-trained Language Models	2023	VLDB	7.2148323e-05
8,852	Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation	2023	SIGMOD	4.4313992e-05
10,109	Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations	2026	SIGMOD	4.1905499e-05
5,506	Can Large Language Models Predict Data Correlations from Column Names?	2023	VLDB	5.4711611e-05
6,798	DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models	2024	SIGMOD	4.9186164e-05