Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design

Summary: Case study converting Gmail's privacy-safe, production rule-based IE to Software 2.0: use rule outputs as training labels to build ML extractors that improve precision/recall, shrink codebase, and enable cross-language extraction. Discusses challenges in training-data generation/management, model evaluation, and necessary Software‑1.0 infrastructure to safely deploy ML extractors. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 368
Venue: CIDR
Year: 2020
Pagerank: 5.1725247e-05
Overall Rank: 11,547 | 19.75%
DOI: -

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank

Outgoing Citations (Sorted by Pagerank)

Showing 10 of 10 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
45	The Case for Learned Index Structures	2018	SIGMOD	0.0004530684
63	Freebase: A Collaboratively Created Graph Database For Structuring Human Knowledge	2008	SIGMOD	0.00039278196
117	HoloClean: Holistic Data Repairs with Probabilistic Inference	2017	VLDB	0.00032346273
576	Extracting Structured Data from Web Pages	2003	SIGMOD	0.00016335144
3,290	Automatic Wrappers for Large Scale Web Extraction	2011	VLDB	7.6161829e-05
3,549	The Role of Massively Multi-Task and Weak Supervision in Software 2.0	2019	CIDR	7.3816787e-05
3,658	Snorkel: Fast Training Set Generation for Information Extraction	2017	SIGMOD	7.2831799e-05
5,674	DIADEM: Thousands of Websites to a Single Database	2014	VLDB	6.1908158e-05
6,237	CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web	2018	VLDB	6.0066515e-05
11,678	Online Template Induction for Machine-Generated Emails	2019	VLDB	5.1725247e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
11,633	Leveraging Organizational Resources to Adapt Models to New Data Modalities	2020	VLDB	5.1725247e-05
2,584	Brainwash: A Data System for Feature Engineering	2013	CIDR	8.4544411e-05
6,238	Data Collection and Quality Challenges for Deep Learning	2020	VLDB	6.0064915e-05
9,268	Glean: Structured Extractions from Templatic Documents	2021	VLDB	5.3572577e-05
265	On the Design and Quantification of Privacy Preserving Data Mining Algorithms	2001	PODS	0.00023055086
6,315	Automatic Rule Refinement for Information Extraction	2010	VLDB	5.9760631e-05
9,413	Retrofitting GDPR Compliance onto Legacy Databases	2022	VLDB	5.3341661e-05
12,419	Mining Patterns and Rules for Software Specification Discovery	2008	VLDB	5.1725247e-05
12,625	Privacy in Data Systems	2003	PODS	5.1725247e-05
3,549	The Role of Massively Multi-Task and Weak Supervision in Software 2.0	2019	CIDR	7.3816787e-05