Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines
Summary: mlinspect: a library that extracts a DAG representation of Python ML preprocessing pipelines to enable lightweight lineage-based inspection of brittle data issues affecting reliability, accountability, and fairness. Automatically instruments declarative abstractions (estimator/transformer pipelines) via lightweight annotation propagation—no manual code instrumentation—enabling end-to-end inspection in native ML stacks. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 6 of 6 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,407 | End-to-end Optimization of Machine Learning Prediction Queries | 2022 | SIGMOD | 7.1295646e-05 |
| 4,734 | MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines | 2021 | SIGMOD | 5.9615384e-05 |
| 4,774 | LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems | 2021 | SIGMOD | 5.9316087e-05 |
| 8,840 | The Cost of Representation by Subset Repairs | 2025 | VLDB | 4.4388652e-05 |
| 11,103 | LucidScript: Bottom-up Standardization for Data Preparation | 2024 | VLDB | 4.1945683e-05 |
| 11,310 | Screening Native ML Pipelines with “ArgusEyes” | 2022 | CIDR | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 11 of 11 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 31 | Provenance Semirings | 2007 | PODS | 0.0007857786 |
| 1,404 | Responsible Data Management | 2020 | VLDB | 0.00012174977 |
| 1,482 | Automating Large-Scale Data Quality Verification | 2018 | VLDB | 0.00011725533 |
| 1,750 | Weld: A Common Runtime for High Performance Data Analytics | 2017 | CIDR | 0.00010683647 |
| 2,028 | Putting Lipstick on Pig: Enabling Database-style Workflow Provenance | 2012 | VLDB | 9.7433981e-05 |
| 2,152 | MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis | 2018 | SIGMOD | 9.4239787e-05 |
| 2,443 | Data Management for Data Science: Towards Embedded Analytics | 2020 | CIDR | 8.8078476e-05 |
| 2,463 | noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts | 2017 | VLDB | 8.7561396e-05 |
| 4,426 | Data Debugging and Exploration with Vizier | 2019 | SIGMOD | 6.1969994e-05 |
| 5,341 | Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows | 2011 | SIGMOD | 5.5607484e-05 |
| 5,684 | Dagger: A Data (not code) Debugger | 2020 | CIDR | 5.3720749e-05 |
Previous
Page 1 / 1
Next