Extracting Logical Hierarchical Structure of HTML Documents Based on Headings
Summary: Heading-driven extraction of the logical HTML hierarchy, addressing mismatch between markup and semantics. Uses heading position, visual prominence, and level-based styling to derive hierarchical blocks with their associated headings, outperforming prior methods. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Tomohiro Manabe
- 2. Keishi Tajima
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,461 | Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents | 2019 | SIGMOD | 4.5061205e-05 |
| 9,252 | Improving Information Extraction from Visually Rich Documents using Visual Span Representations | 2021 | VLDB | 4.3690661e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 4 of 4 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 587 | Extracting Structured Data from Web Pages | 2003 | SIGMOD | 0.00019648348 |
| 2,224 | The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents | 2005 | VLDB | 9.251962e-05 |
| 2,633 | Schema Extraction for Tabular Data on the Web | 2013 | VLDB | 8.4063569e-05 |
| 5,399 | Joint Unsupervised Structure Discovery and Information Extraction | 2011 | SIGMOD | 5.5291067e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,317 | Harvesting Relational Tables from Lists on the Web | 2009 | VLDB | 0.00012625853 |
| 12,545 | A Framework for Processing Complex Document-centric XML with Overlapping Structures | 2005 | SIGMOD | 4.1945683e-05 |
| 13,808 | A Method of Re-ranking Web Search Results Using their Hidden Hyperlink Structure | 2002 | VLDB | - |
| 4,440 | Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model | 2009 | SIGMOD | 6.187819e-05 |
| 7,826 | The Smallest Extraction Problem | 2021 | VLDB | 4.6416742e-05 |
| 1,851 | An Analysis of Structured Data on the Web | 2012 | VLDB | 0.00010327871 |
| 587 | Extracting Structured Data from Web Pages | 2003 | SIGMOD | 0.00019648348 |
| 3,285 | Using the Structure of Web Sites for Automatic Segmentation of Tables | 2004 | SIGMOD | 7.2759001e-05 |
| 2,005 | Record-Boundary Discovery in Web Documents | 1999 | SIGMOD | 9.8112591e-05 |
| 5,774 | A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration | 2009 | VLDB | 5.3313642e-05 |