Database Paper Browser

Back to papers

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Summary: RoadRunner enables automatic data extraction from large web sites by generating wrappers via HTML page similarity/difference analysis. Real-world data-intensive site experiments demonstrate feasibility and scalability of the wrapper generation approach. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
8735
Venue
VLDB
Year
2001
Pagerank
0.00020757722
Overall Rank
533 | 96.30%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 31 of 31 citing papers.

Rank Citing Paper Year Venue Pagerank
587 Extracting Structured Data from Web Pages 2003 SIGMOD 0.00019648348
652 On the Provenance of Non-Answers to Queries over Extracted Data 2008 VLDB 0.00018634477
1,221 A Web of Concepts 2009 PODS 0.00013219242
1,317 Harvesting Relational Tables from Lists on the Web 2009 VLDB 0.00012625853
1,851 An Analysis of Structured Data on the Web 2012 VLDB 0.00010327871
2,362 Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax 2004 SIGMOD 8.9582251e-05
2,425 Instance-based Schema Matching for Web Databases by Domain-specific Query Probing 2004 VLDB 8.8376569e-05
3,285 Using the Structure of Web Sites for Automatic Segmentation of Tables 2004 SIGMOD 7.2759001e-05
3,678 Automatic Wrappers for Large Scale Web Extraction 2011 VLDB 6.8517545e-05
3,690 Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets 2018 SIGMOD 6.8384476e-05
3,724 Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web 2005 CIDR 6.8173288e-05
3,747 Context-Aware Wrapping: Synchronized Data Extraction 2007 VLDB 6.7917216e-05
3,921 On the Complexity of Deriving Schema Mappings from Database Instances 2008 PODS 6.6301252e-05
4,137 Exploiting Content Redundancy for Web Information Extraction 2010 VLDB 6.4181549e-05
4,440 Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model 2009 SIGMOD 6.187819e-05
4,707 Object-level Vertical Search 2007 CIDR 5.9810753e-05
5,609 Documentum ECI Self-Repairing Wrappers: Performance Analysis 2006 SIGMOD 5.4129892e-05
5,652 From Information to Knowledge: Harvesting Entities and Relationships from Web Sources 2010 PODS 5.3903671e-05
6,195 WADaR: Joint Wrapper and Data Repair 2015 VLDB 5.1618114e-05
6,403 RoadRunner: Automatic Data Extraction from Data-Intensive Web Sites 2002 SIGMOD 5.0797045e-05
6,412 CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web 2018 VLDB 5.0740036e-05
6,751 Optimal Schemes for Robust Web Extraction 2011 VLDB 4.939042e-05
6,996 Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference 2020 SIGMOD 4.8681362e-05
7,826 The Smallest Extraction Problem 2021 VLDB 4.6416742e-05
7,919 DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web 2015 VLDB 4.616746e-05
8,461 Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents 2019 SIGMOD 4.5061205e-05
8,632 Measuring the Structural Similarity of Semistructured Documents Using Entropy 2007 VLDB 4.4803734e-05
12,258 ObjectRunner: Lightweight, Targeted Extraction and Querying of Structured Web Data 2010 VLDB 4.1945683e-05
12,280 Building Ranked Mashups of Unstructured Sources with Uncertain Information 2010 VLDB 4.1945683e-05
12,525 Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages 2006 VLDB 4.1945683e-05
12,590 An Automatic Data Grabber for Large Web Sites 2004 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 3 of 3 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
385 NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. 1998 SIGMOD 0.00024795739
1,919 Cut and Paste 1997 PODS 0.00010094755
2,204 To Weave the Web 1997 VLDB 9.2970809e-05
Previous Page 1 / 1 Next

Semantically Similar Papers