Web Page Language Identification Based on URLs
Summary: URL-only language classifier to identify page language without fetching content. Across five languages (en, fr, de, es, it), F-measures up to 0.96 and recall up to 0.95, outperforming ccTLD baselines and enabling quota-aware crawling. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Authors
- 1. Eda Baykan
- 2. Monika Henzinger
- 3. Ingmar Weber
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 0 of 0 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,878 | Learning to Extract Form Labels | 2008 | VLDB | 4.4302126e-05 |
| 11,844 | Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale | 2016 | SIGMOD | 4.1945683e-05 |
| 587 | Extracting Structured Data from Web Pages | 2003 | SIGMOD | 0.00019648348 |
| 7,826 | The Smallest Extraction Problem | 2021 | VLDB | 4.6416742e-05 |
| 3,285 | Using the Structure of Web Sites for Automatic Segmentation of Tables | 2004 | SIGMOD | 7.2759001e-05 |
| 12,590 | An Automatic Data Grabber for Large Web Sites | 2004 | VLDB | 4.1945683e-05 |
| 2,600 | Multilingual Schema Matching for Wikipedia Infoboxes | 2012 | VLDB | 8.4694459e-05 |
| 12,691 | Toward Learning Based Web Query Processing | 2000 | VLDB | 4.1945683e-05 |
| 7,768 | Accurate and Efficient Crawling for Relevant Websites | 2004 | VLDB | 4.6563056e-05 |
| 11,256 | Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages | 2023 | VLDB | 4.1945683e-05 |