Database Paper Browser

Back to papers

Automating Large-Scale Data Quality Verification

Summary: Automates quality verification with a declarative API that blends standard constraints and user code to enable data unit tests. Translates validation to Spark aggregations, enables checks, and uses ML for constraint suggestions and anomaly detection. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11660
Venue
VLDB
Year
2018
Pagerank
0.00011725533
Overall Rank
1,482 | 89.70%
DOI
10.14778/3229863.3229867

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 27 of 27 citing papers.

Rank Citing Paper Year Venue Pagerank
1,940 SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging 2021 SIGMOD 0.00010020173
2,122 SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle 2020 CIDR 9.4989076e-05
2,158 Uni-Detect: A Unified Approach to Automated Error Detection in Tables 2019 SIGMOD 9.4141354e-05
2,517 Annotating Columns with Pre-trained Language Models 2022 SIGMOD 8.6092139e-05
3,252 Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks 2020 SIGMOD 7.3178277e-05
3,491 TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines 2020 SIGMOD 7.0451276e-05
3,508 spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines 2024 VLDB 7.0271496e-05
5,242 Towards Benchmarking Feature Type Inference for AutoML Platforms 2021 SIGMOD 5.6074743e-05
5,928 SchemaPile: A Large Collection of Relational Database Schemas 2024 SIGMOD 5.2685946e-05
6,291 Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines 2021 CIDR 5.1269764e-05
6,993 Unit Testing Data with Deequ 2019 SIGMOD 4.8693227e-05
7,838 Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes 2021 SIGMOD 4.6377995e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,422 Deducing Certain Fixes to Graphs 2019 VLDB 4.5167705e-05
8,514 UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads 2022 VLDB 4.4944285e-05
8,853 Complaint-Driven Training Data Debugging at Interactive Speeds 2022 SIGMOD 4.4350727e-05
8,915 DQDF: Data-Quality-Aware Dataframes 2022 VLDB 4.427232e-05
9,118 Towards Observability for Production Machine Learning Pipelines 2022 VLDB 4.3928288e-05
9,231 Modyn: Data-Centric Machine Learning Pipeline Orchestration 2025 SIGMOD 4.3690661e-05
9,856 In-Database Data Imputation 2024 SIGMOD 4.269353e-05
10,291 Morphing-based Compression for Data-centric ML Pipelines 2026 VLDB 4.1945683e-05
10,628 CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines 2025 VLDB 4.1945683e-05
10,867 T-Assess: An Efficient Data Quality Assessment System Tailored for Trajectory Data 2025 VLDB 4.1945683e-05
11,052 Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines 2024 VLDB 4.1945683e-05
11,280 CM-Explorer: Dissecting Data Ingestion Problems 2023 VLDB 4.1945683e-05
11,317 Data Management Opportunities for Foundation Models 2022 CIDR 4.1945683e-05
13,300 DEEM 2019: Workshop on Data Management for End-to-End Machine Learning 2019 SIGMOD -
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 16 of 16 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers