Snorkel: Rapid Training Data Creation with Weak Supervision

Summary: Snorkel enables rapid ML training from weak supervision via labeling functions with unknown accuracies. End-to-end data programming denoises labels without ground truth, with a tradeoff optimizer, showing speedups and accuracy gains over hand labeling. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 11741
Venue: VLDB
Year: 2018
Pagerank: 0.00030540555
Overall Rank: 254 | 98.24%
DOI: 10.14778/3157794.3157797

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 70 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
300	Deep Learning for Entity Matching: A Design Space Exploration	2018	SIGMOD	0.00028441466
1,116	Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes	2024	VLDB	0.00013890154
1,215	Snuba: Automating Weak Supervision to Label Training Data	2019	VLDB	0.0001323375
1,337	HoloDetect: Few-Shot Learning for Error Detection	2019	SIGMOD	0.00012497164
1,666	HELIX: Holistic Optimization for Accelerating Iterative Machine Learning	2019	VLDB	0.0001096361
1,940	SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging	2021	SIGMOD	0.00010020173
1,993	Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning	2020	SIGMOD	9.8453334e-05
2,321	DBPal: A Fully Pluggable NL2SQL Training Pipeline	2020	SIGMOD	9.03609e-05
2,825	Smile: A System to Support Machine Learning on EEG Data at Scale	2019	VLDB	8.0563426e-05
2,839	VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition	2021	VLDB	8.0378978e-05
2,958	The Role of Massively Multi-Task and Weak Supervision in Software 2.0	2019	CIDR	7.8173975e-05
3,303	Fonduer: Knowledge Base Construction from Richly Formatted Data	2018	SIGMOD	7.2487486e-05
3,508	spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines	2024	VLDB	7.0271496e-05
3,942	Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins	2022	VLDB	6.6114622e-05
4,196	Overton: A Data System for Monitoring and Improving Machine-Learned Products	2020	CIDR	6.3686231e-05
4,456	AutoOD: Automatic Outlier Detection	2023	SIGMOD	6.1704203e-05
4,471	GOGGLES: Automatic Image Labeling with Affinity Coding	2020	SIGMOD	6.1555681e-05
4,590	MB2: Decomposed Behavior Modeling for Self-Driving Database Management Systems	2021	SIGMOD	6.0620053e-05
4,607	Data Integration and Machine Learning: A Natural Synergy	2018	SIGMOD	6.0538827e-05
4,751	ODIN: Automated Drift Detection and Recovery in Video Analytics	2020	VLDB	5.9485403e-05
4,872	Explainable AI: Foundations, Applications, Opportunities for Data Management Research	2022	SIGMOD	5.8609352e-05
4,935	OmniFair: A Declarative System for Model-Agnostic Group Fairness in Machine Learning	2021	SIGMOD	5.8198727e-05
5,242	Towards Benchmarking Feature Type Inference for AutoML Platforms	2021	SIGMOD	5.6074743e-05
5,251	Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale	2019	SIGMOD	5.6029615e-05
5,347	Adaptive Rule Discovery for Labeling Text Data	2021	SIGMOD	5.5560452e-05
5,381	Selective Data Acquisition in the Wild for Model Charging	2022	VLDB	5.5399508e-05
5,412	Mining an "Anti-Knowledge Base" from Wikipedia Updates with Applications to Fact Checking and Beyond	2020	VLDB	5.5207515e-05
5,869	Demonstration of Panda: A Weakly Supervised Entity Matching System	2021	VLDB	5.2959029e-05
5,963	Automatic Data Acquisition for Deep Learning	2021	VLDB	5.2526794e-05
5,978	Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond	2021	SIGMOD	5.2453012e-05
6,042	MDedup: Duplicate Detection with Matching Dependencies	2020	VLDB	5.2405269e-05
6,130	VOCAL: Video Organization and Interactive Compositional AnaLytics	2022	CIDR	5.1962107e-05
6,134	Finding Label and Model Errors in Perception Data With Learned Observation Assertions	2022	SIGMOD	5.1943414e-05
6,228	Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems	2021	VLDB	5.1470042e-05
6,247	Optimizing In-memory Database Engine for AI-powered On-line Decision Augmentation Using Persistent Memory	2021	VLDB	5.1389201e-05
6,519	Expand your Training Limits! Generating Training Data for ML-based Data Management	2021	SIGMOD	5.0316686e-05
6,868	Cost-Effective Data Annotation using Game-Based Crowdsourcing	2019	VLDB	4.9010083e-05
7,243	Data Integration and Machine Learning: A Natural Synergy	2018	VLDB	4.7913666e-05
7,288	Witan: Unsupervised Labelling Function Generation for Assisted Data Programming	2022	VLDB	4.7762276e-05
7,643	Cross Modal Data Discovery over Structured and Unstructured Data Lakes	2023	VLDB	4.6901105e-05
7,656	Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets	2022	SIGMOD	4.6871575e-05
7,796	CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties	2021	VLDB	4.6482625e-05
8,055	iFlipper: Label Flipping for Individual Fairness	2023	SIGMOD	4.5947404e-05
8,182	SHiFT: An Efficient, Flexible Search Engine for Transfer Learning	2023	VLDB	4.5659133e-05
8,292	Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming	2022	VLDB	4.5435639e-05
8,343	CrowdGame: A Game-Based Crowdsourcing System for Cost-Effective Data Labeling	2019	SIGMOD	4.5429217e-05
8,514	UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads	2022	VLDB	4.4944285e-05
8,714	LANCET: Labeling Complex Data at Scale	2021	VLDB	4.4619818e-05
9,192	Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale	2022	VLDB	4.3765131e-05
9,252	Improving Information Extraction from Visually Rich Documents using Visual Span Representations	2021	VLDB	4.3690661e-05

Outgoing Citations (Sorted by Pagerank)

Showing 5 of 5 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
192	HoloClean: Holistic Data Repairs with Probabilistic Inference	2017	VLDB	0.00035728858
371	A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration	2012	VLDB	0.00025389696
398	Big Data Integration	2013	VLDB	0.00024372588
908	Fusing Data with Correlations	2014	SIGMOD	0.00015431241
3,897	SLiMFast: Guaranteed Results for Data Fusion and Source Reliability	2017	SIGMOD	6.6554845e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
8,590	Exploratory Training: When Annotators Learn About Data	2023	SIGMOD	4.4896282e-05
9,409	Ground Truth Inference for Weakly Supervised Entity Matching	2023	SIGMOD	4.3441378e-05
5,963	Automatic Data Acquisition for Deep Learning	2021	VLDB	5.2526794e-05
5,347	Adaptive Rule Discovery for Labeling Text Data	2021	SIGMOD	5.5560452e-05
8,292	Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming	2022	VLDB	4.5435639e-05
6,955	Inspector Gadget: A Data Programming-based Labeling System for Industrial Images	2021	VLDB	4.8864297e-05
2,958	The Role of Massively Multi-Task and Weak Supervision in Software 2.0	2019	CIDR	7.8173975e-05
5,251	Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale	2019	SIGMOD	5.6029615e-05
1,215	Snuba: Automating Weak Supervision to Label Training Data	2019	VLDB	0.0001323375
4,087	Snorkel: Fast Training Set Generation for Information Extraction	2017	SIGMOD	6.4607746e-05