RLclean: An unsupervised integrated data cleaning framework based on deep reinforcement learning

被引：2

作者：

Peng, Jinfeng ^{[1
]}

Shen, Derong ^{[1
]}

Nie, Tiezheng ^{[1
]}

Kou, Yue ^{[1
]}

机构：

[1] Sch Northeastern Univ, Coll Comp Sci & Engn, Shenyang, Peoples R China

来源：

INFORMATION SCIENCES | 2024年 / 682卷

基金：

中国国家自然科学基金;

关键词：

Error detection; Data repair; Deep reinforcement learning; ERROR-DETECTION; REPRESENTATION; ALGORITHM;

D O I：

10.1016/j.ins.2024.121281

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Data cleaning, a prerequisite to subsequent data analysis, has always been the focus of data science research. Datasets with errors can severely detract from the quality of downstream analytical results. Unfortunately, despite the proliferation of various data cleaning methods, it remains a time-consuming problem and frequently entails considerable labor expenses. In reality, errors are often heterogeneous and require different solutions. As a result, stand-alone methods often inadequate for addressing dirty data with multiple types of errors, while studies have demonstrated that combining such methods always require human intervention and the result remains unsatisfactory. In this paper, we propose an unsupervised integrated data cleaning framework, namely RLclean. Based on deep reinforcement learning, RLclean takes advantages of multiple data cleaning techniques, enabling it to effectively clean multiple types of errors and achieve satisfactory results. Additionally, it eliminates the need for costly human involvement, as the cleaning strategy is learned by data-driven, which further allows the framework to self-adapt to diverse domains. RLclean mainly consists of two parts: (i) an integrated error detection model that unites multiple techniques to detect different types of errors from multiple views; and (ii) an integrated data repair model that learns the optimal repair operations and repairs dirty data in an unsupervised manner. Extensive experiments on benchmark datasets have demonstrated the superiority of RLclean over state-of-the-art methods.

引用

页数：15

共 46 条

[1]

Abedjan Z, 2016, PROC VLDB ENDOW, V9, P993

[2] Automatic weighted matching rectifying rule discovery for data repairing Can we discover effective repairing rules automatically from dirty data? [J].

Abu Ahmad, Hiba ;

Wang, Hongzhi .

VLDB JOURNAL, 2020, 29 (06) :1433-1447

[3]

Berti-Equille L., 2020, 10 C INN DAT SYST RE, P1

[4]

Biessmann F, 2019, J MACH LEARN RES, V20

[5]

Chen S., 2023, P ACM MANAGEMENT DAT, V1, P1

[6] A meta-framework for multi-label active learning based on deep reinforcement learning [J].

Chen, Shuyue ;

Wang, Ran ;

Lu, Jian .

NEURAL NETWORKS, 2023, 162 :258-270

[7] XGBoost: A Scalable Tree Boosting System [J].

Chen, Tianqi ;

Guestrin, Carlos .

KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794

[8]

Chu X, 2015, PROC VLDB ENDOW, V8, P1953

[9] TURL: Table Understanding through Representation Learning [J].

Deng, Xiang ;

Sun, Huan ;

Lees, Alyssa ;

Wu, You ;

Yu, Cong .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (03) :307-319

[10] NADEEF: A Generalized Data Cleaning System [J].

Ebaid, Amr ;

Elmagarmid, Ahmed ;

Ilyas, Ihab F. ;

Ouzzani, Mourad ;

Quiane-Ruiz, Jorge-Arnulfo ;

Tang, Nan ;

Yin, Si .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (12) :1218-1221

← 1 2 3 4 5 →