Database Paper Browser

Back to papers

Record-Boundary Discovery in Web Documents

Summary: Tree-based heuristic for record-boundary discovery in Web documents with multiple records: locate the record subtree and derive a consensus separator tag from five independent heuristics. Linear-time in practice with 100% accuracy in experiments, enabling robust record-level extraction. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
3122
Venue
SIGMOD
Year
1999
Pagerank
9.8112591e-05
Overall Rank
2,005 | 86.06%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 7 of 7 citing papers.

Rank Citing Paper Year Venue Pagerank
637 Automatic segmentation of text into structured records 2001 SIGMOD 0.00018824614
1,317 Harvesting Relational Tables from Lists on the Web 2009 VLDB 0.00012625853
3,285 Using the Structure of Web Sites for Automatic Segmentation of Tables 2004 SIGMOD 7.2759001e-05
4,707 Object-level Vertical Search 2007 CIDR 5.9810753e-05
9,248 Web Record Extraction with Invariants 2023 VLDB 4.3690661e-05
12,525 Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages 2006 VLDB 4.1945683e-05
12,691 Toward Learning Based Web Query Processing 2000 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 2 of 2 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
385 NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. 1998 SIGMOD 0.00024795739
1,919 Cut and Paste 1997 PODS 0.00010094755
Previous Page 1 / 1 Next

Semantically Similar Papers