Back to papers
Visual Template Inference for Data Extraction from Documents
Summary: TWIX infers latent visual templates for programmatically generated documents by clustering consistently co-located fields and enforcing alignment constraints (e.g., column/header and key/value alignment) to assemble templates for extraction. Template-driven extraction achieves >25% higher precision/recall than Evaporate/Textract/Azure/GPT-4-Vision on 34 datasets and is massively more scalable (≈520× faster, ≈3,786× cheaper) on large corpora.
(summarized by gpt-5-mini on Feb 11 2026)
- Paper ID
- 7436
- Venue
- SIGMOD
- Year
- 2026
- Pagerank
- 4.1945683e-05
- Overall Rank
- 10,126 | 29.56%
- DOI
-
10.1145/3769840
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
Outgoing Citations (Sorted by Pagerank)
Showing 15 of 15 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 587 |
Extracting Structured Data from Web Pages |
2003 |
SIGMOD |
0.00019648348 |
| 1,116 |
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes |
2024 |
VLDB |
0.00013890154 |
| 1,221 |
A Web of Concepts |
2009 |
PODS |
0.00013219242 |
| 1,851 |
An Analysis of Structured Data on the Web |
2012 |
VLDB |
0.00010327871 |
| 1,963 |
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing |
2025 |
VLDB |
9.929429e-05 |
| 3,678 |
Automatic Wrappers for Large Scale Web Extraction |
2011 |
VLDB |
6.8517545e-05 |
| 3,690 |
Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets |
2018 |
SIGMOD |
6.8384476e-05 |
| 3,742 |
TEGRA: Table Extraction by Global Record Alignment |
2015 |
SIGMOD |
6.7966898e-05 |
| 4,440 |
Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model |
2009 |
SIGMOD |
6.187819e-05 |
| 6,751 |
Optimal Schemes for Robust Web Extraction |
2011 |
VLDB |
4.939042e-05 |
| 7,826 |
The Smallest Extraction Problem |
2021 |
VLDB |
4.6416742e-05 |
| 8,307 |
Automatic Web-Scale Information Extraction |
2012 |
SIGMOD |
4.5435639e-05 |
| 9,248 |
Web Record Extraction with Invariants |
2023 |
VLDB |
4.3690661e-05 |
| 9,252 |
Improving Information Extraction from Visually Rich Documents using Visual Span Representations |
2021 |
VLDB |
4.3690661e-05 |
| 9,253 |
Glean: Structured Extractions from Templatic Documents |
2021 |
VLDB |
4.3690661e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 11,240 |
Autonomously Computable Information Extraction |
2023 |
VLDB |
4.1945683e-05 |
| 13,134 |
DocDB: A Database for Unstructured Document Analysis |
2025 |
VLDB |
- |
| 7,424 |
Table Extraction and Understanding for Scientific and Enterprise Applications |
2020 |
VLDB |
4.7339251e-05 |
| 10,973 |
Unstructured Data Fusion for Schema and Data Extraction |
2024 |
SIGMOD |
4.1945683e-05 |
| 12,115 |
Just-in-Time Information Extraction using Extraction Views |
2012 |
SIGMOD |
4.1945683e-05 |
| 10,438 |
Doctopus: A System for Budget-aware Structural Data Extraction from Unstructured Documents |
2025 |
SIGMOD |
4.1945683e-05 |
| 9,252 |
Improving Information Extraction from Visually Rich Documents using Visual Span Representations |
2021 |
VLDB |
4.3690661e-05 |
| 8,461 |
Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents |
2019 |
SIGMOD |
4.5061205e-05 |
| 11,391 |
Blueprint: A Constraint-solving Approach For Document Extraction |
2022 |
VLDB |
4.1945683e-05 |
| 9,253 |
Glean: Structured Extractions from Templatic Documents |
2021 |
VLDB |
4.3690661e-05 |