Database Paper Browser

Back to papers

Automatic segmentation of text into structured records

Summary: Automatic segmentation of unformatted text into structured records; datamold learns structure from a small seed set. Extends HMMs with multi-source cues (sequence, length, vocabulary, external dictionary) for robust address extraction; 90% Asian, 99% US accuracy, beating rule-based IE. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
3267
Venue
SIGMOD
Year
2001
Pagerank
0.00018824614
Overall Rank
637 | 95.58%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 16 of 16 citing papers.

Rank Citing Paper Year Venue Pagerank
112 Potter's Wheel: An Interactive Data Cleaning System 2001 VLDB 0.00047045036
280 Eliminating Fuzzy Duplicates in Data Warehouses 2002 VLDB 0.00029113044
760 Creating Probabilistic Databases from Information Extraction Models 2006 VLDB 0.00017053935
1,317 Harvesting Relational Tables from Lists on the Web 2009 VLDB 0.00012625853
1,533 Example-driven Design of Efficient Record Matching Queries 2007 VLDB 0.00011471971
1,762 Tuning Schema Matching Software using Synthetic Scenarios 2005 VLDB 0.00010646894
3,285 Using the Structure of Web Sites for Automatic Segmentation of Tables 2004 SIGMOD 7.2759001e-05
3,529 Merging the Results of Approximate Match Operations 2004 VLDB 7.0059524e-05
3,742 TEGRA: Table Extraction by Global Record Alignment 2015 SIGMOD 6.7966898e-05
4,137 Exploiting Content Redundancy for Web Information Extraction 2010 VLDB 6.4181549e-05
5,399 Joint Unsupervised Structure Discovery and Information Extraction 2011 SIGMOD 5.5291067e-05
5,431 Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach 2013 VLDB 5.5076946e-05
7,397 A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces 2011 VLDB 4.7417648e-05
8,007 A Grammar-based Entity Representation Framework for Data Cleaning 2009 SIGMOD 4.6068018e-05
9,423 Database Principles in Information Extraction 2014 PODS 4.3441378e-05
12,230 ONDUX: On-Demand Unsupervised Learning for Information Extraction 2010 SIGMOD 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 4 of 4 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers