Pytheas: Pattern-based Table Discovery in CSV Files
Summary: Pytheas uses pattern-based line classification and column-value coherency to discover tables in loosely structured CSVs. It achieves precision/recall above 95% (vs ~89/81), generalizes across countries, and provides a confidence measure for potential errors. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 7,102 | Mondrian: Spreadsheet Layout Detection | 2022 | SIGMOD | 4.8307982e-05 |
| 7,807 | Pollock: A Data Loading Benchmark | 2023 | VLDB | 4.6457732e-05 |
| 8,503 | A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science | 2021 | VLDB | 4.496339e-05 |
| 11,420 | Detecting Layout Templates in Complex Multiregion Files | 2022 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 13 of 13 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,001 | Recovering Semantics of Tables on the Web | 2011 | VLDB | 0.00014706505 |
| 5,529 | Data-Driven Domain Discovery for Structured Datasets | 2020 | VLDB | 5.4566641e-05 |
| 818 | Finding Related Tables | 2012 | SIGMOD | 0.00016311524 |
| 7,424 | Table Extraction and Understanding for Scientific and Enterprise Applications | 2020 | VLDB | 4.7339251e-05 |
| 8,913 | Making Table Understanding Work in Practice | 2022 | CIDR | 4.427232e-05 |
| 1,317 | Harvesting Relational Tables from Lists on the Web | 2009 | VLDB | 0.00012625853 |
| 10,109 | Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations | 2026 | SIGMOD | 4.1945683e-05 |
| 2,633 | Schema Extraction for Tabular Data on the Web | 2013 | VLDB | 8.4063569e-05 |
| 11,348 | Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational Data | 2022 | SIGMOD | 4.1945683e-05 |
| 6,846 | A framework for annotating CSV-like data | 2016 | VLDB | 4.9092462e-05 |