Glean: Structured Extractions from Templatic Documents

Summary: Glean extracts structured data from templatic documents and generalizes to unseen layouts. Data-management focus: high-quality ground truth, efficient training-data generation, and rapid tooling; yields ~5 F1-point gain on the same model. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 12613
Venue: VLDB
Year: 2021
Pagerank: 4.3648789e-05
Overall Rank: 9,260 | 35.65%
DOI: 10.14778/3447689.3447703

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 1 of 1 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
10,126	Visual Template Inference for Data Extraction from Documents	2026	SIGMOD	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 7 of 7 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
252	Snorkel: Rapid Training Data Creation with Weak Supervision	2018	VLDB	0.00030532082
1,274	The Data Civilizer System	2017	CIDR	0.00012869297
1,316	Harvesting Relational Tables from Lists on the Web	2009	VLDB	0.00012616422
3,305	Fonduer: Knowledge Base Construction from Richly Formatted Data	2018	SIGMOD	7.2417724e-05
5,698	Dagger: A Data (not code) Debugger	2020	CIDR	5.3669165e-05
7,394	A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces	2011	VLDB	4.7372157e-05
8,456	Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents	2019	SIGMOD	4.5018004e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
9,259	Improving Information Extraction from Visually Rich Documents using Visual Span Representations	2021	VLDB	4.3648789e-05
11,242	Autonomously Computable Information Extraction	2023	VLDB	4.1905499e-05
8,637	Machine Learning for Data Management: Problems and Solutions	2018	SIGMOD	4.4755972e-05
1,088	Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes	2024	VLDB	0.00014158762
11,783	Building Structured Databases of Factual Knowledge from Massive Text Corpora	2017	SIGMOD	4.1905499e-05
586	Extracting Structured Data from Web Pages	2003	SIGMOD	0.0001963091
636	Automatic segmentation of text into structured records	2001	SIGMOD	0.00018815341
11,258	Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages	2023	VLDB	4.1905499e-05
11,393	Blueprint: A Constraint-solving Approach For Document Extraction	2022	VLDB	4.1905499e-05
10,126	Visual Template Inference for Data Extraction from Documents	2026	SIGMOD	4.1905499e-05