Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale
Summary: End-to-end web-scale domain-specific IE built on a Stratosphere-based pipeline with focused crawling, HTML repair, and multi-stage NLP/NER. Compared with Medline and full-text corpora to gauge scalability and quality; demonstrates real-world, domain-specific IE at web scale. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Authors
- 1. Astrid Rheinländer
- 2. Mario Lehmann
- 3. Anja Kunkel
- 4. Jörg Meier
- 5. Ulf Leser
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 1 of 1 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 287 | Declarative Information Extraction Using Datalog with Embedded Extraction Predicates | 2007 | VLDB | 0.00028971272 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 12,044 | Knowledge Harvesting in the Big-Data Era | 2013 | SIGMOD | 4.1945683e-05 |
| 3,931 | Extracting and Querying a Comprehensive Web Database | 2009 | CIDR | 6.6193836e-05 |
| 420 | InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables | 2012 | SIGMOD | 0.00023719065 |
| 9,136 | TextCube: Automated Construction and Multidimensional Exploration | 2019 | VLDB | 4.3881065e-05 |
| 13,626 | Managing Information Extraction [Tutorial Outline] | 2006 | SIGMOD | - |
| 5,379 | Scalable Ad-hoc Entity Extraction from Text Collections | 2008 | VLDB | 5.5405989e-05 |
| 11,256 | Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages | 2023 | VLDB | 4.1945683e-05 |
| 7,280 | I4E: Interactive Investigation of Iterative Information Extraction | 2010 | SIGMOD | 4.778826e-05 |
| 1,395 | Structured Querying of Web Text: A Technical Challenge | 2007 | CIDR | 0.00012207039 |
| 11,775 | Building Structured Databases of Factual Knowledge from Massive Text Corpora | 2017 | SIGMOD | 4.1945683e-05 |