Database Paper Browser

Back to papers

Snorkel: Rapid Training Data Creation with Weak Supervision

Summary: Snorkel enables rapid ML training from weak supervision via labeling functions with unknown accuracies. End-to-end data programming denoises labels without ground truth, with a tradeoff optimizer, showing speedups and accuracy gains over hand labeling. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11741
Venue
VLDB
Year
2018
Pagerank
0.00030540555
Overall Rank
254 | 98.24%
DOI
10.14778/3157794.3157797

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 70 citing papers.

Rank Citing Paper Year Venue Pagerank
300 Deep Learning for Entity Matching: A Design Space Exploration 2018 SIGMOD 0.00028441466
1,116 Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes 2024 VLDB 0.00013890154
1,215 Snuba: Automating Weak Supervision to Label Training Data 2019 VLDB 0.0001323375
1,337 HoloDetect: Few-Shot Learning for Error Detection 2019 SIGMOD 0.00012497164
1,666 HELIX: Holistic Optimization for Accelerating Iterative Machine Learning 2019 VLDB 0.0001096361
1,940 SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging 2021 SIGMOD 0.00010020173
1,993 Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning 2020 SIGMOD 9.8453334e-05
2,321 DBPal: A Fully Pluggable NL2SQL Training Pipeline 2020 SIGMOD 9.03609e-05
2,825 Smile: A System to Support Machine Learning on EEG Data at Scale 2019 VLDB 8.0563426e-05
2,839 VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition 2021 VLDB 8.0378978e-05
2,958 The Role of Massively Multi-Task and Weak Supervision in Software 2.0 2019 CIDR 7.8173975e-05
3,303 Fonduer: Knowledge Base Construction from Richly Formatted Data 2018 SIGMOD 7.2487486e-05
3,508 spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines 2024 VLDB 7.0271496e-05
3,942 Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins 2022 VLDB 6.6114622e-05
4,196 Overton: A Data System for Monitoring and Improving Machine-Learned Products 2020 CIDR 6.3686231e-05
4,456 AutoOD: Automatic Outlier Detection 2023 SIGMOD 6.1704203e-05
4,471 GOGGLES: Automatic Image Labeling with Affinity Coding 2020 SIGMOD 6.1555681e-05
4,590 MB2: Decomposed Behavior Modeling for Self-Driving Database Management Systems 2021 SIGMOD 6.0620053e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
4,751 ODIN: Automated Drift Detection and Recovery in Video Analytics 2020 VLDB 5.9485403e-05
4,872 Explainable AI: Foundations, Applications, Opportunities for Data Management Research 2022 SIGMOD 5.8609352e-05
4,935 OmniFair: A Declarative System for Model-Agnostic Group Fairness in Machine Learning 2021 SIGMOD 5.8198727e-05
5,242 Towards Benchmarking Feature Type Inference for AutoML Platforms 2021 SIGMOD 5.6074743e-05
5,251 Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale 2019 SIGMOD 5.6029615e-05
5,347 Adaptive Rule Discovery for Labeling Text Data 2021 SIGMOD 5.5560452e-05
5,381 Selective Data Acquisition in the Wild for Model Charging 2022 VLDB 5.5399508e-05
5,412 Mining an "Anti-Knowledge Base" from Wikipedia Updates with Applications to Fact Checking and Beyond 2020 VLDB 5.5207515e-05
5,869 Demonstration of Panda: A Weakly Supervised Entity Matching System 2021 VLDB 5.2959029e-05
5,963 Automatic Data Acquisition for Deep Learning 2021 VLDB 5.2526794e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,042 MDedup: Duplicate Detection with Matching Dependencies 2020 VLDB 5.2405269e-05
6,130 VOCAL: Video Organization and Interactive Compositional AnaLytics 2022 CIDR 5.1962107e-05
6,134 Finding Label and Model Errors in Perception Data With Learned Observation Assertions 2022 SIGMOD 5.1943414e-05
6,228 Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems 2021 VLDB 5.1470042e-05
6,247 Optimizing In-memory Database Engine for AI-powered On-line Decision Augmentation Using Persistent Memory 2021 VLDB 5.1389201e-05
6,519 Expand your Training Limits! Generating Training Data for ML-based Data Management 2021 SIGMOD 5.0316686e-05
6,868 Cost-Effective Data Annotation using Game-Based Crowdsourcing 2019 VLDB 4.9010083e-05
7,243 Data Integration and Machine Learning: A Natural Synergy 2018 VLDB 4.7913666e-05
7,288 Witan: Unsupervised Labelling Function Generation for Assisted Data Programming 2022 VLDB 4.7762276e-05
7,643 Cross Modal Data Discovery over Structured and Unstructured Data Lakes 2023 VLDB 4.6901105e-05
7,656 Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets 2022 SIGMOD 4.6871575e-05
7,796 CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties 2021 VLDB 4.6482625e-05
8,055 iFlipper: Label Flipping for Individual Fairness 2023 SIGMOD 4.5947404e-05
8,182 SHiFT: An Efficient, Flexible Search Engine for Transfer Learning 2023 VLDB 4.5659133e-05
8,292 Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming 2022 VLDB 4.5435639e-05
8,343 CrowdGame: A Game-Based Crowdsourcing System for Cost-Effective Data Labeling 2019 SIGMOD 4.5429217e-05
8,514 UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads 2022 VLDB 4.4944285e-05
8,714 LANCET: Labeling Complex Data at Scale 2021 VLDB 4.4619818e-05
9,192 Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale 2022 VLDB 4.3765131e-05
9,252 Improving Information Extraction from Visually Rich Documents using Visual Span Representations 2021 VLDB 4.3690661e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 5 of 5 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
192 HoloClean: Holistic Data Repairs with Probabilistic Inference 2017 VLDB 0.00035728858
371 A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration 2012 VLDB 0.00025389696
398 Big Data Integration 2013 VLDB 0.00024372588
908 Fusing Data with Correlations 2014 SIGMOD 0.00015431241
3,897 SLiMFast: Guaranteed Results for Data Fusion and Source Reliability 2017 SIGMOD 6.6554845e-05
Previous Page 1 / 1 Next

Semantically Similar Papers