Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design
Summary: Case study converting Gmail's privacy-safe, production rule-based IE to Software 2.0: use rule outputs as training labels to build ML extractors that improve precision/recall, shrink codebase, and enable cross-language extraction. Discusses challenges in training-data generation/management, model evaluation, and necessary Software‑1.0 infrastructure to safely deploy ML extractors. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Authors
- 1. Ying Sheng
- 2. Nguyen Vo
- 3. James B. Wendt
- 4. Sandeep Tata
- 5. Marc Najork
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 10 of 10 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 62 | Freebase: A Collaboratively Created Graph Database For Structuring Human Knowledge | 2008 | SIGMOD | 0.0006429466 |
| 102 | The Case for Learned Index Structures | 2018 | SIGMOD | 0.00049545203 |
| 192 | HoloClean: Holistic Data Repairs with Probabilistic Inference | 2017 | VLDB | 0.00035728858 |
| 587 | Extracting Structured Data from Web Pages | 2003 | SIGMOD | 0.00019648348 |
| 2,958 | The Role of Massively Multi-Task and Weak Supervision in Software 2.0 | 2019 | CIDR | 7.8173975e-05 |
| 3,678 | Automatic Wrappers for Large Scale Web Extraction | 2011 | VLDB | 6.8517545e-05 |
| 4,087 | Snorkel: Fast Training Set Generation for Information Extraction | 2017 | SIGMOD | 6.4607746e-05 |
| 6,133 | DIADEM: Thousands of Websites to a Single Database | 2014 | VLDB | 5.1954702e-05 |
| 6,412 | CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web | 2018 | VLDB | 5.0740036e-05 |
| 11,673 | Online Template Induction for Machine-Generated Emails | 2019 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,915 | Brainwash: A Data System for Feature Engineering | 2013 | CIDR | 7.9078385e-05 |
| 11,629 | Leveraging Organizational Resources to Adapt Models to New Data Modalities | 2020 | VLDB | 4.1945683e-05 |
| 6,526 | Data Collection and Quality Challenges for Deep Learning | 2020 | VLDB | 5.0267429e-05 |
| 9,253 | Glean: Structured Extractions from Templatic Documents | 2021 | VLDB | 4.3690661e-05 |
| 147 | On the Design and Quantification of Privacy Preserving Data Mining Algorithms | 2001 | PODS | 0.00041235556 |
| 6,534 | Automatic Rule Refinement for Information Extraction | 2010 | VLDB | 5.0244622e-05 |
| 9,412 | Retrofitting GDPR Compliance onto Legacy Databases | 2022 | VLDB | 4.3441378e-05 |
| 12,410 | Mining Patterns and Rules for Software Specification Discovery | 2008 | VLDB | 4.1945683e-05 |
| 12,616 | Privacy in Data Systems | 2003 | PODS | 4.1945683e-05 |
| 2,958 | The Role of Massively Multi-Task and Weak Supervision in Software 2.0 | 2019 | CIDR | 7.8173975e-05 |