Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages
Summary: LEAST: self-training that synthesizes weakly-labeled fine-tuning corpora for semi-structured web IE using minimal human annotations. Uses uncertainty-aware training to mitigate noisy labels, generalizes across backbones/verticals and cuts human labels up to 11x (<10 pages/site). (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Authors
- 1. Ritesh Sarkhel
- 2. Binxuan Huang
- 3. Colin Lockard
- 4. Prashant Shiralkar
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 6 of 6 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,617 | Extraction and Integration of Partially Overlapping Web Sources | 2013 | VLDB | 8.4462621e-05 |
| 3,303 | Fonduer: Knowledge Base Construction from Richly Formatted Data | 2018 | SIGMOD | 7.2487486e-05 |
| 3,574 | KBQA: Learning Question Answering over QA Corpora and Knowledge Bases | 2017 | VLDB | 6.9533902e-05 |
| 6,412 | CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web | 2018 | VLDB | 5.0740036e-05 |
| 8,461 | Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents | 2019 | SIGMOD | 4.5061205e-05 |
| 9,252 | Improving Information Extraction from Visually Rich Documents using Visual Span Representations | 2021 | VLDB | 4.3690661e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,973 | Unstructured Data Fusion for Schema and Data Extraction | 2024 | SIGMOD | 4.1945683e-05 |
| 10,316 | LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning | 2026 | VLDB | 4.1945683e-05 |
| 12,691 | Toward Learning Based Web Query Processing | 2000 | VLDB | 4.1945683e-05 |
| 7,280 | I4E: Interactive Investigation of Iterative Information Extraction | 2010 | SIGMOD | 4.778826e-05 |
| 11,775 | Building Structured Databases of Factual Knowledge from Massive Text Corpora | 2017 | SIGMOD | 4.1945683e-05 |
| 587 | Extracting Structured Data from Web Pages | 2003 | SIGMOD | 0.00019648348 |
| 4,092 | Structured Annotations of Web Queries | 2010 | SIGMOD | 6.4561959e-05 |
| 3,285 | Using the Structure of Web Sites for Automatic Segmentation of Tables | 2004 | SIGMOD | 7.2759001e-05 |
| 11,844 | Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale | 2016 | SIGMOD | 4.1945683e-05 |
| 7,826 | The Smallest Extraction Problem | 2021 | VLDB | 4.6416742e-05 |