Web Record Extraction with Invariants
Summary: Introduce invariants for Web data records to relax prior assumptions of uniform schemas and linear layouts, enabling discovery of heterogeneous, nested records on modern pages. Propose Miria, a bottom-up recursive algorithm that reconstructs records from invariants and outperforms prior extractors. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Zhijia Chen
- 2. Weiyi Meng
- 3. Eduard Dragut
Incoming Citations (Sorted by Pagerank)
Showing 1 of 1 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,126 | Visual Template Inference for Data Extraction from Documents | 2026 | SIGMOD | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 12 of 12 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,395 | Structured Querying of Web Text: A Technical Challenge | 2007 | CIDR | 0.00012207039 |
| 1,851 | An Analysis of Structured Data on the Web | 2012 | VLDB | 0.00010327871 |
| 2,633 | Schema Extraction for Tabular Data on the Web | 2013 | VLDB | 8.4063569e-05 |
| 2,005 | Record-Boundary Discovery in Web Documents | 1999 | SIGMOD | 9.8112591e-05 |
| 5,774 | A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration | 2009 | VLDB | 5.3313642e-05 |
| 12,525 | Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages | 2006 | VLDB | 4.1945683e-05 |
| 4,137 | Exploiting Content Redundancy for Web Information Extraction | 2010 | VLDB | 6.4181549e-05 |
| 587 | Extracting Structured Data from Web Pages | 2003 | SIGMOD | 0.00019648348 |
| 3,285 | Using the Structure of Web Sites for Automatic Segmentation of Tables | 2004 | SIGMOD | 7.2759001e-05 |
| 2,617 | Extraction and Integration of Partially Overlapping Web Sources | 2013 | VLDB | 8.4462621e-05 |