The Smallest Extraction Problem

Summary: Introduces landmark grammars, a CFG family for templated HTML that reduces ambiguity in Web data extraction. Defines SEP to learn a grammar from related pages, with an unsupervised induction algorithm and an automatic extraction system showing improved performance. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 12421
Venue: VLDB
Year: 2021
Pagerank: 4.6372231e-05
Overall Rank: 7,831 | 45.58%
DOI: 10.14778/3476249.3476293

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 2 of 2 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
9,255	Web Record Extraction with Invariants	2023	VLDB	4.3648789e-05
10,126	Visual Template Inference for Data Extraction from Documents	2026	SIGMOD	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 13 of 13 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
384	NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents.	1998	SIGMOD	0.00024774707
534	RoadRunner: Towards Automatic Data Extraction from Large Web Sites	2001	VLDB	0.00020739037
586	Extracting Structured Data from Web Pages	2003	SIGMOD	0.0001963091
1,094	The Lixto Data Extraction Project - Back and Forth between Theory and Practice	2004	PODS	0.00014105105
2,607	Extraction and Integration of Partially Overlapping Web Sources	2013	VLDB	8.4615436e-05
4,438	Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model	2009	SIGMOD	6.1819088e-05
5,619	Documentum ECI Self-Repairing Wrappers: Performance Analysis	2006	SIGMOD	5.4077879e-05
6,137	DIADEM: Thousands of Websites to a Single Database	2014	VLDB	5.190481e-05
6,197	WADaR: Joint Wrapper and Data Repair	2015	VLDB	5.1570343e-05
6,407	CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web	2018	VLDB	5.0691335e-05
6,755	Optimal Schemes for Robust Web Extraction	2011	VLDB	4.9343136e-05
6,994	Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference	2020	SIGMOD	4.8634654e-05
9,028	Robust and Noise Resistant Wrapper Induction	2016	SIGMOD	4.4009442e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
3,679	Automatic Wrappers for Large Scale Web Extraction	2011	VLDB	6.8460927e-05
1,398	Structured Querying of Web Text: A Technical Challenge	2007	CIDR	0.00012201166
2,607	Extraction and Integration of Partially Overlapping Web Sources	2013	VLDB	8.4615436e-05
2,011	Record-Boundary Discovery in Web Documents	1999	SIGMOD	9.8032193e-05
4,438	Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model	2009	SIGMOD	6.1819088e-05
2,368	Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax	2004	SIGMOD	8.9496022e-05
3,286	Using the Structure of Web Sites for Automatic Segmentation of Tables	2004	SIGMOD	7.2692403e-05
586	Extracting Structured Data from Web Pages	2003	SIGMOD	0.0001963091
11,258	Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages	2023	VLDB	4.1905499e-05
6,956	Computational Aspects of Resilient Data Extraction from Semistructured Sources	2000	PODS	4.8811374e-05