Database Paper Browser

Back to papers

Harvesting Relational Tables from Lists on the Web

Summary: Unsupervised extraction of relational tables from web lists, handling delimiters and missing fields. Uses an HTML-table corpus to validate splits and alignments, yields an extraction score, and scales to ~100k lists, implying tens of millions of usable tables. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
9874
Venue
VLDB
Year
2009
Pagerank
0.00012625853
Overall Rank
1,317 | 90.84%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 21 of 21 citing papers.

Rank Citing Paper Year Venue Pagerank
518 Data Integration for the Relational Web 2009 VLDB 0.00021158934
818 Finding Related Tables 2012 SIGMOD 0.00016311524
1,001 Recovering Semantics of Tables on the Web 2011 VLDB 0.00014706505
1,469 BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations 2016 VLDB 0.00011836053
1,585 Answering Table Augmentation Queries from Unstructured Lists on the Web 2009 VLDB 0.00011255098
1,851 An Analysis of Structured Data on the Web 2012 VLDB 0.00010327871
2,587 Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks 2024 SIGMOD 8.4924618e-05
2,617 Extraction and Integration of Partially Overlapping Web Sources 2013 VLDB 8.4462621e-05
3,155 Ten Years of WebTables 2018 VLDB 7.4672742e-05
3,252 Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks 2020 SIGMOD 7.3178277e-05
3,678 Automatic Wrappers for Large Scale Web Extraction 2011 VLDB 6.8517545e-05
3,690 Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets 2018 SIGMOD 6.8384476e-05
3,742 TEGRA: Table Extraction by Global Record Alignment 2015 SIGMOD 6.7966898e-05
3,963 Pytheas: Pattern-based Table Discovery in CSV Files 2020 VLDB 6.5840643e-05
5,652 From Information to Knowledge: Harvesting Entities and Relationships from Web Sources 2010 PODS 5.3903671e-05
6,992 An Efficient Publish/Subscribe Index for E-Commerce Databases 2014 VLDB 4.8701339e-05
7,588 Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases 2013 VLDB 4.7030914e-05
8,307 Automatic Web-Scale Information Extraction 2012 SIGMOD 4.5435639e-05
9,248 Web Record Extraction with Invariants 2023 VLDB 4.3690661e-05
9,253 Glean: Structured Extractions from Templatic Documents 2021 VLDB 4.3690661e-05
12,052 Provenance-based Dictionary Refinement in Information Extraction 2013 SIGMOD 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 7 of 7 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
107 WebTables: Exploring the Power of Tables on the Web 2008 VLDB 0.00048377684
533 RoadRunner: Towards Automatic Data Extraction from Large Web Sites 2001 VLDB 0.00020757722
587 Extracting Structured Data from Web Pages 2003 SIGMOD 0.00019648348
637 Automatic segmentation of text into structured records 2001 SIGMOD 0.00018824614
1,537 Google's Deep-Web Crawl 2008 VLDB 0.00011465704
2,005 Record-Boundary Discovery in Web Documents 1999 SIGMOD 9.8112591e-05
3,285 Using the Structure of Web Sites for Automatic Segmentation of Tables 2004 SIGMOD 7.2759001e-05
Previous Page 1 / 1 Next

Semantically Similar Papers