Database Paper Browser

Back to papers

Exploiting Content Redundancy for Web Information Extraction

Summary: Exploits content redundancy on template-based sites to bootstrap a seed database and extract matching attribute values from new pages. Presents a template-aware similarity metric and an Apriori-style search over fixed attribute positions to filter noise, with experiments on real web data. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
10106
Venue
VLDB
Year
2010
Pagerank
6.4181549e-05
Overall Rank
4,137 | 71.23%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 6 of 6 citing papers.

Rank Citing Paper Year Venue Pagerank
1,851 An Analysis of Structured Data on the Web 2012 VLDB 0.00010327871
2,617 Extraction and Integration of Partially Overlapping Web Sources 2013 VLDB 8.4462621e-05
6,133 DIADEM: Thousands of Websites to a Single Database 2014 VLDB 5.1954702e-05
6,412 CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web 2018 VLDB 5.0740036e-05
9,026 Robust and Noise Resistant Wrapper Induction 2016 SIGMOD 4.4051668e-05
11,706 Big Data Linkage for Product Specification Pages 2018 SIGMOD 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 6 of 6 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers