Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model
Summary: Proposes a probabilistic tree-edit model for HTML learned from web snapshots to track evolution with quadratic-time likelihood. A wrapper framework uses the model to build robust wrappers, outperforming traditional ones on synthetic and real data. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Nilesh Dalvi
- 2. Philip Bohannon
- 3. Fei Sha
Incoming Citations (Sorted by Pagerank)
Showing 11 of 11 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,221 | A Web of Concepts | 2009 | PODS | 0.00013219242 |
| 3,678 | Automatic Wrappers for Large Scale Web Extraction | 2011 | VLDB | 6.8517545e-05 |
| 3,690 | Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets | 2018 | SIGMOD | 6.8384476e-05 |
| 6,133 | DIADEM: Thousands of Websites to a Single Database | 2014 | VLDB | 5.1954702e-05 |
| 6,751 | Optimal Schemes for Robust Web Extraction | 2011 | VLDB | 4.939042e-05 |
| 7,826 | The Smallest Extraction Problem | 2021 | VLDB | 4.6416742e-05 |
| 7,919 | DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web | 2015 | VLDB | 4.616746e-05 |
| 9,026 | Robust and Noise Resistant Wrapper Induction | 2016 | SIGMOD | 4.4051668e-05 |
| 10,126 | Visual Template Inference for Data Extraction from Documents | 2026 | SIGMOD | 4.1945683e-05 |
| 11,706 | Big Data Linkage for Product Specification Pages | 2018 | SIGMOD | 4.1945683e-05 |
| 12,280 | Building Ranked Mashups of Unstructured Sources with Uncertain Information | 2010 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 6 of 6 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 533 | RoadRunner: Towards Automatic Data Extraction from Large Web Sites | 2001 | VLDB | 0.00020757722 |
| 1,132 | Building light-weight wrappers for legacy Web data-sources using W4F | 1999 | VLDB | 0.00013777657 |
| 2,698 | Visual Web Information Extraction with Lixto* | 2001 | VLDB | 8.2753317e-05 |
| 5,174 | Mapping Maintenance for Data Integration Systems | 2005 | VLDB | 5.6443463e-05 |
| 5,609 | Documentum ECI Self-Repairing Wrappers: Performance Analysis | 2006 | SIGMOD | 5.4129892e-05 |
| 6,118 | myPortal: Robust Extraction and Aggregation of Web Content | 2006 | VLDB | 5.2023907e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 533 | RoadRunner: Towards Automatic Data Extraction from Large Web Sites | 2001 | VLDB | 0.00020757722 |
| 8,322 | An XML-based Wrapper Generator for Web Information Extraction | 1999 | SIGMOD | 4.5435639e-05 |
| 3,285 | Using the Structure of Web Sites for Automatic Segmentation of Tables | 2004 | SIGMOD | 7.2759001e-05 |
| 5,774 | A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration | 2009 | VLDB | 5.3313642e-05 |
| 2,005 | Record-Boundary Discovery in Web Documents | 1999 | SIGMOD | 9.8112591e-05 |
| 12,590 | An Automatic Data Grabber for Large Web Sites | 2004 | VLDB | 4.1945683e-05 |
| 6,958 | Computational Aspects of Resilient Data Extraction from Semistructured Sources | 2000 | PODS | 4.8857878e-05 |
| 3,678 | Automatic Wrappers for Large Scale Web Extraction | 2011 | VLDB | 6.8517545e-05 |
| 9,026 | Robust and Noise Resistant Wrapper Induction | 2016 | SIGMOD | 4.4051668e-05 |
| 6,751 | Optimal Schemes for Robust Web Extraction | 2011 | VLDB | 4.939042e-05 |