Database Paper Browser

Back to papers

Snuba: Automating Weak Supervision to Label Training Data

Summary: Snuba automates weak supervision by generating task-specific labeling heuristics from a small labeled set to label a large unlabeled corpus. It grows coverage iteratively with a statistical termination guarantee, finishing under five minutes and beating handcrafted rules by 9.74 F1 and semi-supervised baselines by 14.35 F1. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11941
Venue
VLDB
Year
2019
Pagerank
0.0001323375
Overall Rank
1,215 | 91.55%
DOI
10.14778/3291264.3291268

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 26 of 26 citing papers.

Rank Citing Paper Year Venue Pagerank
1,116 Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes 2024 VLDB 0.00013890154
2,825 Smile: A System to Support Machine Learning on EEG Data at Scale 2019 VLDB 8.0563426e-05
2,958 The Role of Massively Multi-Task and Weak Supervision in Software 2.0 2019 CIDR 7.8173975e-05
4,471 GOGGLES: Automatic Image Labeling with Affinity Coding 2020 SIGMOD 6.1555681e-05
4,872 Explainable AI: Foundations, Applications, Opportunities for Data Management Research 2022 SIGMOD 5.8609352e-05
5,242 Towards Benchmarking Feature Type Inference for AutoML Platforms 2021 SIGMOD 5.6074743e-05
5,347 Adaptive Rule Discovery for Labeling Text Data 2021 SIGMOD 5.5560452e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,130 VOCAL: Video Organization and Interactive Compositional AnaLytics 2022 CIDR 5.1962107e-05
6,955 Inspector Gadget: A Data Programming-based Labeling System for Industrial Images 2021 VLDB 4.8864297e-05
7,288 Witan: Unsupervised Labelling Function Generation for Assisted Data Programming 2022 VLDB 4.7762276e-05
7,796 CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties 2021 VLDB 4.6482625e-05
8,292 Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming 2022 VLDB 4.5435639e-05
8,343 CrowdGame: A Game-Based Crowdsourcing System for Cost-Effective Data Labeling 2019 SIGMOD 4.5429217e-05
8,514 UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads 2022 VLDB 4.4944285e-05
8,590 Exploratory Training: When Annotators Learn About Data 2023 SIGMOD 4.4896282e-05
8,714 LANCET: Labeling Complex Data at Scale 2021 VLDB 4.4619818e-05
9,409 Ground Truth Inference for Weakly Supervised Entity Matching 2023 SIGMOD 4.3441378e-05
9,777 Data Augmentation for ML-driven Data Preparation and Integration 2021 VLDB 4.2856106e-05
9,806 The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format 2024 SIGMOD 4.2805224e-05
9,873 CORAL: Collaborative Automatic Labeling System based on Large Language Models 2024 VLDB 4.2667743e-05
10,291 Morphing-based Compression for Data-centric ML Pipelines 2026 VLDB 4.1945683e-05
10,465 A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces 2025 SIGMOD 4.1945683e-05
10,533 WeShap: Weak Supervision Source Evaluation with Shapley Values 2025 VLDB 4.1945683e-05
11,205 Steered Training Data Generation for Learned Semantic Type Detection 2023 SIGMOD 4.1945683e-05
11,629 Leveraging Organizational Resources to Adapt Models to New Data Modalities 2020 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 5 of 5 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
254 Snorkel: Rapid Training Data Creation with Weak Supervision 2018 VLDB 0.00030540555
398 Big Data Integration 2013 VLDB 0.00024372588
908 Fusing Data with Correlations 2014 SIGMOD 0.00015431241
3,303 Fonduer: Knowledge Base Construction from Richly Formatted Data 2018 SIGMOD 7.2487486e-05
3,897 SLiMFast: Guaranteed Results for Data Fusion and Source Reliability 2017 SIGMOD 6.6554845e-05
Previous Page 1 / 1 Next

Semantically Similar Papers