Database Paper Browser

Back to papers

Extracting Structured Data from Web Pages

Summary: Unsupervised extraction of structured data from template-generated web pages by inferring the template and its encoding. An algorithm given multiple pages jointly learns the template and outputs extracted values; validated on large real page sets. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
3444
Venue
SIGMOD
Year
2003
Pagerank
0.00019648348
Overall Rank
587 | 95.92%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 29 of 29 citing papers.

Rank Citing Paper Year Venue Pagerank
1,317 Harvesting Relational Tables from Lists on the Web 2009 VLDB 0.00012625853
1,851 An Analysis of Structured Data on the Web 2012 VLDB 0.00010327871
2,224 The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents 2005 VLDB 9.251962e-05
2,362 Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax 2004 SIGMOD 8.9582251e-05
2,425 Instance-based Schema Matching for Web Databases by Domain-specific Query Probing 2004 VLDB 8.8376569e-05
2,617 Extraction and Integration of Partially Overlapping Web Sources 2013 VLDB 8.4462621e-05
3,285 Using the Structure of Web Sites for Automatic Segmentation of Tables 2004 SIGMOD 7.2759001e-05
3,678 Automatic Wrappers for Large Scale Web Extraction 2011 VLDB 6.8517545e-05
3,690 Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets 2018 SIGMOD 6.8384476e-05
3,724 Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web 2005 CIDR 6.8173288e-05
3,747 Context-Aware Wrapping: Synchronized Data Extraction 2007 VLDB 6.7917216e-05
4,665 Argonaut: Macrotask Crowdsourcing for Complex Data Processing 2015 VLDB 6.0125329e-05
4,707 Object-level Vertical Search 2007 CIDR 5.9810753e-05
5,652 From Information to Knowledge: Harvesting Entities and Relationships from Web Sources 2010 PODS 5.3903671e-05
6,020 LearnPADS: Automatic Tool Generation from Ad Hoc Data 2008 SIGMOD 5.2415551e-05
6,135 Extracting Logical Hierarchical Structure of HTML Documents Based on Headings 2015 VLDB 5.1930114e-05
6,412 CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web 2018 VLDB 5.0740036e-05
6,996 Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference 2020 SIGMOD 4.8681362e-05
7,826 The Smallest Extraction Problem 2021 VLDB 4.6416742e-05
8,088 PIDS: Attribute Decomposition for Improved Compression and Query Performance in Columnar Storage 2020 VLDB 4.5897316e-05
8,632 Measuring the Structural Similarity of Semistructured Documents Using Entropy 2007 VLDB 4.4803734e-05
9,320 Powering In-Database Dynamic Model Slicing for Structured Data Analytics 2024 VLDB 4.3556432e-05
10,126 Visual Template Inference for Data Extraction from Documents 2026 SIGMOD 4.1945683e-05
11,543 Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design 2020 CIDR 4.1945683e-05
11,673 Online Template Induction for Machine-Generated Emails 2019 VLDB 4.1945683e-05
11,706 Big Data Linkage for Product Specification Pages 2018 SIGMOD 4.1945683e-05
12,258 ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data 2010 VLDB 4.1945683e-05
12,525 Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages 2006 VLDB 4.1945683e-05
12,590 An Automatic Data Grabber for Large Web Sites 2004 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 4 of 4 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers