A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks

被引：27

作者：

Camilo Corrales, David ^{[1
,2
]}

Ledezma, Agapito ^{[1
]}

Carlos Corrales, Juan ^{[2
]}

机构：

[1] Univ Carlos III Madrid, Dept Informat, Madrid 28911, Spain

[2] Univ Cauca, Grp Ingn Telemat, Sector Tulcan, Popayan, Colombia

来源：

APPLIED SOFT COMPUTING | 2020年 / 90卷

关键词：

Case-based reasoning; Classification; Regression; CONCEPTUAL-FRAMEWORK; KNOWLEDGE DISCOVERY; SUPPORT; SIMILARITY; SELECTION; CBR;

D O I：

10.1016/j.asoc.2020.106180

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, advances in Information Technologies (social networks, mobile applications, Internet of Things, etc.) generate a deluge of digital data; but to convert these data into useful information for business decisions is a growing challenge. Exploiting the massive amount of data through knowledge discovery (KD) process includes identifying valid, novel, potentially useful and understandable patterns from a huge volume of data. However, to prepare the data is a non-trivial refinement task that requires technical expertise in methods and algorithms for data cleaning. Consequently, the use of a suitable data analysis technique is a headache for inexpert users. To address these problems, we propose a case-based reasoning system (CBR) to recommend data cleaning algorithms for classification and regression tasks. In our approach, we represent the problem space by the meta-features of the dataset, its attributes, and the target variable. The solution space contains the algorithms of data cleaning used for each dataset. We represent the cases through a Data Cleaning Ontology. The case retrieval mechanism is composed of a filter and similarity phases. In the first phase, we defined two filter approaches based on clustering and quartile analysis. These filters retrieve a reduced number of relevant cases. The second phase computes a ranking of the retrieved cases by filter approaches, and it scores a similarity between a new case and the retrieved cases. The retrieval mechanism proposed was evaluated through a set of judges. The panel of judges scores the similarity between a query case against all cases of the case-base (ground truth). The results of the retrieval mechanism reach an average precision on judges ranking of 94.5% in top 3 (P@3), for top 7 (P@7) 84.55%, while in top 10 (P@10) 78.35%. (C) 2020 Elsevier B.V. All rights reserved.

引用

页数：13

共 63 条

[1]

AAMODT A, 1994, AI COMMUN, V7, P39

[2]

Abutair Hassan Y.A., 2017, 8 INT C AMB SYST NET, V109, P281

[3] Covariance effect analysis of similarity measurement methods for early construction cost estimation using case-based reasoning [J].

Ahn, Joseph ;

Park, Moonseo ;

Lee, Hyun-Soo ;

Ahn, Sung Jin ;

Ji, Sae-Hyun ;

Song, Kwonsik ;

Son, Bo-Sik .

AUTOMATION IN CONSTRUCTION, 2017, 81 :254-266

[4]

[Anonymous], 1995, Goal-Driven Learning

[5]

[Anonymous], P C REC ADV INF TECH

[6]

[Anonymous], 1967, Aust Comput J

[7]

[Anonymous], 2014, PROG ARTIF INTELL, DOI DOI 10.1007/s13748-013-0040-3

[8]

Asuncion A, 2007, UCI machine learning repository

[9]

Barone D, 2010, LECT NOTES COMPUT SC, V6051, P53, DOI 10.1007/978-3-642-13094-6_6

[10]

Baruti R, 2017, Learning alteryx: a beginner's guide to using alteryx for self-service analytics and business intelligence

← 1 2 3 4 5 6 7 →