Back to papers
Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes
Summary: Unsupervised, corpus-driven validation infers data-domain patterns from data lakes to auto-validate data, reducing false positives on strings. Production datalake evaluation shows improved quality-issue detection vs prior methods; Azure Purview Auto-Tag.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6141
- Venue
- SIGMOD
- Year
- 2021
- Pagerank
- 4.6377995e-05
- Overall Rank
- 7,838 | 45.48%
- DOI
-
10.1145/3448016.3457250
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 6 of 6 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 21 of 21 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 22 |
SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets |
2008 |
VLDB |
0.0008456613 |
| 112 |
Potter's Wheel: An Interactive Data Cleaning System |
2001 |
VLDB |
0.00047045036 |
| 224 |
CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies |
2004 |
SIGMOD |
0.00032746205 |
| 475 |
Mining Database Structure; Or, How to Build a Data Quality Browser |
2002 |
SIGMOD |
0.00022303253 |
| 555 |
Discovering Denial Constraints |
2013 |
VLDB |
0.00020254908 |
| 732 |
Discovering Data Quality Rules |
2008 |
VLDB |
0.00017465093 |
| 894 |
A Hybrid Approach to Functional Dependency Discovery |
2016 |
SIGMOD |
0.00015556428 |
| 1,337 |
HoloDetect: Few-Shot Learning for Error Detection |
2019 |
SIGMOD |
0.00012497164 |
| 1,420 |
Data Management Challenges in Production Machine Learning |
2017 |
SIGMOD |
0.00012057956 |
| 1,482 |
Automating Large-Scale Data Quality Verification |
2018 |
VLDB |
0.00011725533 |
| 2,158 |
Uni-Detect: A Unified Approach to Automated Error Detection in Tables |
2019 |
SIGMOD |
9.4141354e-05 |
| 2,506 |
Auto-Detect: Data-Driven Error Detection in Tables |
2018 |
SIGMOD |
8.6335464e-05 |
| 2,574 |
Discovery of Genuine Functional Dependencies from Relational Data with Missing Values |
2018 |
VLDB |
8.5173637e-05 |
| 2,888 |
Sato: Contextual Semantic Type Detection in Tables |
2020 |
VLDB |
7.9594996e-05 |
| 2,968 |
Raha: A Configuration-Free Error Detection System |
2019 |
SIGMOD |
7.7985097e-05 |
| 3,141 |
ClusterJoin: A Similarity Joins Framework using Map-Reduce |
2014 |
VLDB |
7.4829448e-05 |
| 3,299 |
SCODED: Statistical Constraint Oriented Data Error Detection |
2020 |
SIGMOD |
7.2546659e-05 |
| 4,929 |
Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux |
2010 |
VLDB |
5.8217296e-05 |
| 5,205 |
ANMAT: Automatic Knowledge Discovery and Error Detection through Pattern Functional Dependencies |
2019 |
SIGMOD |
5.630869e-05 |
| 6,416 |
Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code |
2018 |
SIGMOD |
5.072267e-05 |
| 6,993 |
Unit Testing Data with Deequ |
2019 |
SIGMOD |
4.8693227e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 9,490 |
Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph |
2023 |
VLDB |
4.3341665e-05 |
| 3,491 |
TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines |
2020 |
SIGMOD |
7.0451276e-05 |
| 10,142 |
AutoDDG: Automated Dataset Description Generation using Large Language Models |
2026 |
SIGMOD |
4.1945683e-05 |
| 5,383 |
Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search |
2021 |
VLDB |
5.5393038e-05 |
| 10,598 |
Auto-Prep: Holistic Prediction of Data Preparation Steps for Self-Service Business Intelligence |
2025 |
VLDB |
4.1945683e-05 |
| 8,416 |
Towards Building Autonomous Data Services on Azure |
2023 |
SIGMOD |
4.5196199e-05 |
| 2,506 |
Auto-Detect: Data-Driven Error Detection in Tables |
2018 |
SIGMOD |
8.6335464e-05 |
| 5,096 |
Auto-Transform: Learning-to-Transform by Patterns |
2020 |
VLDB |
5.7011825e-05 |
| 10,512 |
Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables |
2025 |
SIGMOD |
4.1945683e-05 |
| 1,482 |
Automating Large-Scale Data Quality Verification |
2018 |
VLDB |
0.00011725533 |