Active Reinforcement Learning for Data Preparation: Learn2Clean with Human-In-The-Loop

Summary: Introduces Learn2Clean: a human-in-the-loop active reinforcement learning method that incrementally explores and prunes the combinatorial space of data-cleaning and preprocessing pipelines by querying users to guide actions for a target ML task. Frames data preparation as AI-hard, leveraging human feedback to trade off evaluation cost and downstream model quality, contrasting with passive AutoML/bandit approaches. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 380
Venue: CIDR
Year: 2020
Pagerank: 4.1905499e-05
Overall Rank: 11,553 | 19.71%
DOI: -

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

1. Laure Berti-Equille

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank

Outgoing Citations (Sorted by Pagerank)

Showing 1 of 1 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
917	Democratizing Data Science through Interactive Curation of ML Pipelines	2019	SIGMOD	0.00015324193

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
5,965	Automatic Data Acquisition for Deep Learning	2021	VLDB	5.2476363e-05
1,629	Data Cleaning: Overview and Emerging Challenges	2016	SIGMOD	0.00011073148
10,816	DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity	2025	VLDB	4.1905499e-05
6,188	Semi-Supervised Data Cleaning with Raha and Baran	2021	CIDR	5.1607275e-05
5,389	Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search	2021	VLDB	5.5339832e-05
8,739	CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning	2024	SIGMOD	4.4520434e-05
13,245	Data Cleaning in the Era of Data Science: Challenges and Opportunities	2021	CIDR	-
5,930	ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning	2016	SIGMOD	5.2632185e-05
8,828	HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation	2023	SIGMOD	4.4364918e-05
917	Democratizing Data Science through Interactive Curation of ML Pipelines	2019	SIGMOD	0.00015324193