Back to papers
Extracting Structured Data from Web Pages
Summary: Unsupervised extraction of structured data from template-generated web pages by inferring the template and its encoding. An algorithm given multiple pages jointly learns the template and outputs extracted values; validated on large real page sets.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 3444
- Venue
- SIGMOD
- Year
- 2003
- Pagerank
- 0.00019648348
- Overall Rank
- 587 | 95.92%
- DOI
-
-
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 29 of 29 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,317 |
Harvesting Relational Tables from Lists on the Web |
2009 |
VLDB |
0.00012625853 |
| 1,851 |
An Analysis of Structured Data on the Web |
2012 |
VLDB |
0.00010327871 |
| 2,224 |
The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents |
2005 |
VLDB |
9.251962e-05 |
| 2,362 |
Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax |
2004 |
SIGMOD |
8.9582251e-05 |
| 2,425 |
Instance-based Schema Matching for Web Databases by Domain-specific Query Probing |
2004 |
VLDB |
8.8376569e-05 |
| 2,617 |
Extraction and Integration of Partially Overlapping Web Sources |
2013 |
VLDB |
8.4462621e-05 |
| 3,285 |
Using the Structure of Web Sites for Automatic Segmentation of Tables |
2004 |
SIGMOD |
7.2759001e-05 |
| 3,678 |
Automatic Wrappers for Large Scale Web Extraction |
2011 |
VLDB |
6.8517545e-05 |
| 3,690 |
Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets |
2018 |
SIGMOD |
6.8384476e-05 |
| 3,724 |
Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web |
2005 |
CIDR |
6.8173288e-05 |
| 3,747 |
Context-Aware Wrapping: Synchronized Data Extraction |
2007 |
VLDB |
6.7917216e-05 |
| 4,665 |
Argonaut: Macrotask Crowdsourcing for Complex Data Processing |
2015 |
VLDB |
6.0125329e-05 |
| 4,707 |
Object-level Vertical Search |
2007 |
CIDR |
5.9810753e-05 |
| 5,652 |
From Information to Knowledge: Harvesting Entities and Relationships from Web Sources |
2010 |
PODS |
5.3903671e-05 |
| 6,020 |
LearnPADS: Automatic Tool Generation from Ad Hoc Data |
2008 |
SIGMOD |
5.2415551e-05 |
| 6,135 |
Extracting Logical Hierarchical Structure of HTML Documents Based on Headings |
2015 |
VLDB |
5.1930114e-05 |
| 6,412 |
CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web |
2018 |
VLDB |
5.0740036e-05 |
| 6,996 |
Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference |
2020 |
SIGMOD |
4.8681362e-05 |
| 7,826 |
The Smallest Extraction Problem |
2021 |
VLDB |
4.6416742e-05 |
| 8,088 |
PIDS: Attribute Decomposition for Improved Compression and Query Performance in Columnar Storage |
2020 |
VLDB |
4.5897316e-05 |
| 8,632 |
Measuring the Structural Similarity of Semistructured Documents Using Entropy |
2007 |
VLDB |
4.4803734e-05 |
| 9,320 |
Powering In-Database Dynamic Model Slicing for Structured Data Analytics |
2024 |
VLDB |
4.3556432e-05 |
| 10,126 |
Visual Template Inference for Data Extraction from Documents |
2026 |
SIGMOD |
4.1945683e-05 |
| 11,543 |
Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design |
2020 |
CIDR |
4.1945683e-05 |
| 11,673 |
Online Template Induction for Machine-Generated Emails |
2019 |
VLDB |
4.1945683e-05 |
| 11,706 |
Big Data Linkage for Product Specification Pages |
2018 |
SIGMOD |
4.1945683e-05 |
| 12,258 |
ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data |
2010 |
VLDB |
4.1945683e-05 |
| 12,525 |
Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages |
2006 |
VLDB |
4.1945683e-05 |
| 12,590 |
An Automatic Data Grabber for Large Web Sites |
2004 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 4 of 4 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 2,633 |
Schema Extraction for Tabular Data on the Web |
2013 |
VLDB |
8.4063569e-05 |
| 9,248 |
Web Record Extraction with Invariants |
2023 |
VLDB |
4.3690661e-05 |
| 12,525 |
Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages |
2006 |
VLDB |
4.1945683e-05 |
| 1,317 |
Harvesting Relational Tables from Lists on the Web |
2009 |
VLDB |
0.00012625853 |
| 6,958 |
Computational Aspects of Resilient Data Extraction from Semistructured Sources |
2000 |
PODS |
4.8857878e-05 |
| 1,395 |
Structured Querying of Web Text: A Technical Challenge |
2007 |
CIDR |
0.00012207039 |
| 1,851 |
An Analysis of Structured Data on the Web |
2012 |
VLDB |
0.00010327871 |
| 4,137 |
Exploiting Content Redundancy for Web Information Extraction |
2010 |
VLDB |
6.4181549e-05 |
| 12,590 |
An Automatic Data Grabber for Large Web Sites |
2004 |
VLDB |
4.1945683e-05 |
| 3,285 |
Using the Structure of Web Sites for Automatic Segmentation of Tables |
2004 |
SIGMOD |
7.2759001e-05 |