Back to papers
Data X-Ray: A Diagnostic Tool for Data Errors
Summary: Data X-Ray reframes cleaning as diagnosing errors in data generation, not purging them. A Bayesian cost model guides diagnostics; an efficient parallel algorithm scales to large datasets, delivering better diagnoses and large speedups over prior methods.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 5081
- Venue
- SIGMOD
- Year
- 2015
- Pagerank
- 7.5568954e-05
- Overall Rank
- 3,105 | 78.41%
- DOI
-
10.1145/2723372.2750549
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 22 of 22 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 2,126 |
MacroBase: Prioritizing Attention in Fast Data |
2017 |
SIGMOD |
9.4887794e-05 |
| 2,154 |
DIFF: A Relational Interface for Large-Scale Data Explanation |
2019 |
VLDB |
9.4208667e-05 |
| 2,460 |
Combining Quantitative and Logical Data Cleaning |
2016 |
VLDB |
8.7617484e-05 |
| 2,753 |
Complaint-driven Training Data Debugging for Query 2.0 |
2020 |
SIGMOD |
8.1724339e-05 |
| 3,299 |
SCODED: Statistical Constraint Oriented Data Error Detection |
2020 |
SIGMOD |
7.2546659e-05 |
| 4,607 |
Data Integration and Machine Learning: A Natural Synergy |
2018 |
SIGMOD |
6.0538827e-05 |
| 5,445 |
QFix: Diagnosing Errors through Query Histories |
2017 |
SIGMOD |
5.5020909e-05 |
| 6,475 |
Explain3D: Explaining Disagreements in Disjoint Datasets |
2019 |
VLDB |
5.0497183e-05 |
| 6,696 |
Approximate Summaries for Why and Why-not Provenance |
2020 |
VLDB |
4.9581958e-05 |
| 6,779 |
Explaining Inference Queries with Bayesian Optimization |
2021 |
VLDB |
4.9280116e-05 |
| 6,817 |
Error Diagnosis and Data Profiling with Data X-Ray |
2015 |
VLDB |
4.9171711e-05 |
| 6,944 |
DataPrism: Exposing Disconnect between Data and Systems |
2022 |
SIGMOD |
4.8912787e-05 |
| 8,341 |
BugDoc: Algorithms to Debug Computational Processes |
2020 |
SIGMOD |
4.5433282e-05 |
| 8,743 |
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning |
2024 |
SIGMOD |
4.456315e-05 |
| 8,853 |
Complaint-Driven Training Data Debugging at Interactive Speeds |
2022 |
SIGMOD |
4.4350727e-05 |
| 9,024 |
Causality-Guided Adaptive Interventional Debugging |
2020 |
SIGMOD |
4.4075011e-05 |
| 9,220 |
BugDoc: A System for Debugging Computational Pipelines |
2020 |
SIGMOD |
4.3702188e-05 |
| 9,533 |
TSExplain: Surfacing Evolving Explanations for Time Series |
2021 |
SIGMOD |
4.3269636e-05 |
| 10,269 |
Database Views as Explanations for Relational Deep Learning |
2026 |
VLDB |
4.1945683e-05 |
| 10,875 |
SDEcho: Efficient Explanation of Aggregated Sequence Difference |
2025 |
VLDB |
4.1945683e-05 |
| 11,837 |
QFix: Demonstrating Error Diagnosis in Query Histories |
2016 |
SIGMOD |
4.1945683e-05 |
| 11,906 |
Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications |
2015 |
SIGMOD |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 29 of 29 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 31 |
Provenance Semirings |
2007 |
PODS |
0.0007857786 |
| 37 |
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud |
2012 |
VLDB |
0.0007522744 |
| 112 |
Potter's Wheel: An Interactive Data Cleaning System |
2001 |
VLDB |
0.00047045036 |
| 214 |
Scorpion: Explaining Away Outliers in Aggregate Queries |
2013 |
VLDB |
0.0003363692 |
| 322 |
Record Linkage: Similarity Measures and Algorithms |
2006 |
SIGMOD |
0.00027518768 |
| 371 |
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration |
2012 |
VLDB |
0.00025389696 |
| 623 |
Improving Data Quality: Consistency and Accuracy |
2007 |
VLDB |
0.00018996374 |
| 691 |
AJAX: An Extensible Data Cleaning Tool |
2000 |
SIGMOD |
0.00018086135 |
| 833 |
Guided Data Repair |
2011 |
VLDB |
0.00016138432 |
| 923 |
Provenance and Scientific Workflows: Challenges and Opportunities |
2008 |
SIGMOD |
0.0001527609 |
| 942 |
A Formal Approach to Finding Explanations for Database Queries |
2014 |
SIGMOD |
0.00015155714 |
| 1,099 |
Interpretable and Informative Explanations of Outcomes |
2015 |
VLDB |
0.00014096312 |
| 1,119 |
The Complexity of Causality and Responsibility for Query Answers and non-Answers |
2011 |
VLDB |
0.0001386199 |
| 1,188 |
On Generating Near-Optimal Tableaux for Conditional Functional Dependencies |
2008 |
VLDB |
0.00013441729 |
| 1,534 |
PerfXplain: Debugging MapReduce Job Performance |
2012 |
VLDB |
0.00011468393 |
| 1,624 |
Sampling the Repairs of Functional Dependency Violations under Hard Constraints |
2010 |
VLDB |
0.00011099222 |
| 2,028 |
Putting Lipstick on Pig: Enabling Database-style Workflow Provenance |
2012 |
VLDB |
9.7433981e-05 |
| 2,379 |
A Revival of Integrity Constraints for Data Cleaning |
2008 |
VLDB |
8.9392633e-05 |
| 2,402 |
Causality and Explanations in Databases |
2014 |
VLDB |
8.8928361e-05 |
| 2,420 |
From Data Fusion to Knowledge Fusion |
2014 |
VLDB |
8.8530994e-05 |
| 2,452 |
Data Fusion – Resolving Data Conflicts for Integration |
2009 |
VLDB |
8.7839322e-05 |
| 2,602 |
Tracing Data Errors with View-Conditioned Causality |
2011 |
SIGMOD |
8.4667197e-05 |
| 2,852 |
MRI: Meaningful Interpretations of Collaborative Ratings |
2011 |
VLDB |
8.0151391e-05 |
| 3,242 |
Explanation-Based Auditing |
2012 |
VLDB |
7.3301779e-05 |
| 4,383 |
Incremental Record Linkage |
2014 |
VLDB |
6.2383094e-05 |
| 4,929 |
Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux |
2010 |
VLDB |
5.8217296e-05 |
| 6,606 |
Explainable Security for Relational Databases |
2014 |
SIGMOD |
4.996456e-05 |
| 6,744 |
MapRat: Meaningful Explanation, Interactive Exploration and Geo-Visualization of Collaborative Ratings |
2012 |
VLDB |
4.9419773e-05 |
| 7,280 |
I4E: Interactive Investigation of Iterative Information Extraction |
2010 |
SIGMOD |
4.778826e-05 |
Semantically Similar Papers