Back to papers
Snuba: Automating Weak Supervision to Label Training Data
Summary: Snuba automates weak supervision by generating task-specific labeling heuristics from a small labeled set to label a large unlabeled corpus. It grows coverage iteratively with a statistical termination guarantee, finishing under five minutes and beating handcrafted rules by 9.74 F1 and semi-supervised baselines by 14.35 F1.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 11941
- Venue
- VLDB
- Year
- 2019
- Pagerank
- 0.0001323375
- Overall Rank
- 1,215 | 91.55%
- DOI
-
10.14778/3291264.3291268
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 26 of 26 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,116 |
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes |
2024 |
VLDB |
0.00013890154 |
| 2,825 |
Smile: A System to Support Machine Learning on EEG Data at Scale |
2019 |
VLDB |
8.0563426e-05 |
| 2,958 |
The Role of Massively Multi-Task and Weak Supervision in Software 2.0 |
2019 |
CIDR |
7.8173975e-05 |
| 4,471 |
GOGGLES: Automatic Image Labeling with Affinity Coding |
2020 |
SIGMOD |
6.1555681e-05 |
| 4,872 |
Explainable AI: Foundations, Applications, Opportunities for Data Management Research |
2022 |
SIGMOD |
5.8609352e-05 |
| 5,242 |
Towards Benchmarking Feature Type Inference for AutoML Platforms |
2021 |
SIGMOD |
5.6074743e-05 |
| 5,347 |
Adaptive Rule Discovery for Labeling Text Data |
2021 |
SIGMOD |
5.5560452e-05 |
| 5,978 |
Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond |
2021 |
SIGMOD |
5.2453012e-05 |
| 6,130 |
VOCAL: Video Organization and Interactive Compositional AnaLytics |
2022 |
CIDR |
5.1962107e-05 |
| 6,955 |
Inspector Gadget: A Data Programming-based Labeling System for Industrial Images |
2021 |
VLDB |
4.8864297e-05 |
| 7,288 |
Witan: Unsupervised Labelling Function Generation for Assisted Data Programming |
2022 |
VLDB |
4.7762276e-05 |
| 7,796 |
CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties |
2021 |
VLDB |
4.6482625e-05 |
| 8,292 |
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming |
2022 |
VLDB |
4.5435639e-05 |
| 8,343 |
CrowdGame: A Game-Based Crowdsourcing System for Cost-Effective Data Labeling |
2019 |
SIGMOD |
4.5429217e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 8,590 |
Exploratory Training: When Annotators Learn About Data |
2023 |
SIGMOD |
4.4896282e-05 |
| 8,714 |
LANCET: Labeling Complex Data at Scale |
2021 |
VLDB |
4.4619818e-05 |
| 9,409 |
Ground Truth Inference for Weakly Supervised Entity Matching |
2023 |
SIGMOD |
4.3441378e-05 |
| 9,777 |
Data Augmentation for ML-driven Data Preparation and Integration |
2021 |
VLDB |
4.2856106e-05 |
| 9,806 |
The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format |
2024 |
SIGMOD |
4.2805224e-05 |
| 9,873 |
CORAL: Collaborative Automatic Labeling System based on Large Language Models |
2024 |
VLDB |
4.2667743e-05 |
| 10,291 |
Morphing-based Compression for Data-centric ML Pipelines |
2026 |
VLDB |
4.1945683e-05 |
| 10,465 |
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,533 |
WeShap: Weak Supervision Source Evaluation with Shapley Values |
2025 |
VLDB |
4.1945683e-05 |
| 11,205 |
Steered Training Data Generation for Learned Semantic Type Detection |
2023 |
SIGMOD |
4.1945683e-05 |
| 11,629 |
Leveraging Organizational Resources to Adapt Models to New Data Modalities |
2020 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 5 of 5 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Semantically Similar Papers