Back to papers
Automating Large-Scale Data Quality Verification
Summary: Automates quality verification with a declarative API that blends standard constraints and user code to enable data unit tests. Translates validation to Spark aggregations, enables checks, and uses ML for constraint suggestions and anomaly detection.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 11660
- Venue
- VLDB
- Year
- 2018
- Pagerank
- 0.00011725533
- Overall Rank
- 1,482 | 89.70%
- DOI
-
10.14778/3229863.3229867
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 27 of 27 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,940 |
SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging |
2021 |
SIGMOD |
0.00010020173 |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 2,158 |
Uni-Detect: A Unified Approach to Automated Error Detection in Tables |
2019 |
SIGMOD |
9.4141354e-05 |
| 2,517 |
Annotating Columns with Pre-trained Language Models |
2022 |
SIGMOD |
8.6092139e-05 |
| 3,252 |
Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks |
2020 |
SIGMOD |
7.3178277e-05 |
| 3,491 |
TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines |
2020 |
SIGMOD |
7.0451276e-05 |
| 3,508 |
spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines |
2024 |
VLDB |
7.0271496e-05 |
| 5,242 |
Towards Benchmarking Feature Type Inference for AutoML Platforms |
2021 |
SIGMOD |
5.6074743e-05 |
| 5,928 |
SchemaPile: A Large Collection of Relational Database Schemas |
2024 |
SIGMOD |
5.2685946e-05 |
| 6,291 |
Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines |
2021 |
CIDR |
5.1269764e-05 |
| 6,993 |
Unit Testing Data with Deequ |
2019 |
SIGMOD |
4.8693227e-05 |
| 7,838 |
Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes |
2021 |
SIGMOD |
4.6377995e-05 |
| 8,092 |
Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications |
2023 |
SIGMOD |
4.587921e-05 |
| 8,422 |
Deducing Certain Fixes to Graphs |
2019 |
VLDB |
4.5167705e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 8,853 |
Complaint-Driven Training Data Debugging at Interactive Speeds |
2022 |
SIGMOD |
4.4350727e-05 |
| 8,915 |
DQDF: Data-Quality-Aware Dataframes |
2022 |
VLDB |
4.427232e-05 |
| 9,118 |
Towards Observability for Production Machine Learning Pipelines |
2022 |
VLDB |
4.3928288e-05 |
| 9,231 |
Modyn: Data-Centric Machine Learning Pipeline Orchestration |
2025 |
SIGMOD |
4.3690661e-05 |
| 9,856 |
In-Database Data Imputation |
2024 |
SIGMOD |
4.269353e-05 |
| 10,291 |
Morphing-based Compression for Data-centric ML Pipelines |
2026 |
VLDB |
4.1945683e-05 |
| 10,628 |
CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines |
2025 |
VLDB |
4.1945683e-05 |
| 10,867 |
T-Assess: An Efficient Data Quality Assessment System Tailored for Trajectory Data |
2025 |
VLDB |
4.1945683e-05 |
| 11,052 |
Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines |
2024 |
VLDB |
4.1945683e-05 |
| 11,280 |
CM-Explorer: Dissecting Data Ingestion Problems |
2023 |
VLDB |
4.1945683e-05 |
| 11,317 |
Data Management Opportunities for Foundation Models |
2022 |
CIDR |
4.1945683e-05 |
| 13,300 |
DEEM 2019: Workshop on Data Management for End-to-End Machine Learning |
2019 |
SIGMOD |
- |
Outgoing Citations (Sorted by Pagerank)
Showing 16 of 16 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 66 |
Spark SQL: Relational Data Processing in Spark |
2015 |
SIGMOD |
0.00061639801 |
| 126 |
Space-Efficient Online Computation of Quantile Summaries |
2001 |
SIGMOD |
0.00044744986 |
| 192 |
HoloClean: Holistic Data Repairs with Probabilistic Inference |
2017 |
VLDB |
0.00035728858 |
| 199 |
Declarative Data Cleaning: Language, Model, and Algorithms |
2001 |
VLDB |
0.00035041015 |
| 475 |
Mining Database Structure; Or, How to Build a Data Quality Browser |
2002 |
SIGMOD |
0.00022303253 |
| 555 |
Discovering Denial Constraints |
2013 |
VLDB |
0.00020254908 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 833 |
Guided Data Repair |
2011 |
VLDB |
0.00016138432 |
| 894 |
A Hybrid Approach to Functional Dependency Discovery |
2016 |
SIGMOD |
0.00015556428 |
| 1,420 |
Data Management Challenges in Production Machine Learning |
2017 |
SIGMOD |
0.00012057956 |
| 1,627 |
Data Cleaning: Overview and Emerging Challenges |
2016 |
SIGMOD |
0.00011086905 |
| 1,683 |
Cardinality Estimation: An Experimental Survey |
2018 |
VLDB |
0.00010922679 |
| 2,269 |
Ground: A Data Context Service |
2017 |
CIDR |
9.147379e-05 |
| 2,463 |
noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts |
2017 |
VLDB |
8.7561396e-05 |
| 5,257 |
Probabilistic Demand Forecasting at Scale |
2017 |
VLDB |
5.6003925e-05 |
| 5,929 |
ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning |
2016 |
SIGMOD |
5.2682177e-05 |
Semantically Similar Papers