A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks

被引:27
作者
Camilo Corrales, David [1 ,2 ]
Ledezma, Agapito [1 ]
Carlos Corrales, Juan [2 ]
机构
[1] Univ Carlos III Madrid, Dept Informat, Madrid 28911, Spain
[2] Univ Cauca, Grp Ingn Telemat, Sector Tulcan, Popayan, Colombia
关键词
Case-based reasoning; Classification; Regression; CONCEPTUAL-FRAMEWORK; KNOWLEDGE DISCOVERY; SUPPORT; SIMILARITY; SELECTION; CBR;
D O I
10.1016/j.asoc.2020.106180
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, advances in Information Technologies (social networks, mobile applications, Internet of Things, etc.) generate a deluge of digital data; but to convert these data into useful information for business decisions is a growing challenge. Exploiting the massive amount of data through knowledge discovery (KD) process includes identifying valid, novel, potentially useful and understandable patterns from a huge volume of data. However, to prepare the data is a non-trivial refinement task that requires technical expertise in methods and algorithms for data cleaning. Consequently, the use of a suitable data analysis technique is a headache for inexpert users. To address these problems, we propose a case-based reasoning system (CBR) to recommend data cleaning algorithms for classification and regression tasks. In our approach, we represent the problem space by the meta-features of the dataset, its attributes, and the target variable. The solution space contains the algorithms of data cleaning used for each dataset. We represent the cases through a Data Cleaning Ontology. The case retrieval mechanism is composed of a filter and similarity phases. In the first phase, we defined two filter approaches based on clustering and quartile analysis. These filters retrieve a reduced number of relevant cases. The second phase computes a ranking of the retrieved cases by filter approaches, and it scores a similarity between a new case and the retrieved cases. The retrieval mechanism proposed was evaluated through a set of judges. The panel of judges scores the similarity between a query case against all cases of the case-base (ground truth). The results of the retrieval mechanism reach an average precision on judges ranking of 94.5% in top 3 (P@3), for top 7 (P@7) 84.55%, while in top 10 (P@10) 78.35%. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 63 条
[11]   KNIME:: The Konstanz Information Miner [J].
Berthold, Michael R. ;
Cebron, Nicolas ;
Dill, Fabian ;
Gabriel, Thomas R. ;
Koetter, Tobias ;
Meinl, Thorsten ;
Ohl, Peter ;
Sieb, Christoph ;
Thiel, Kilian ;
Wiswedel, Bernd .
DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, :319-326
[12]   Towards Intelligent Data Analysis: The Metadata Challenge [J].
Bilalli, Besim ;
Abello, Alberto ;
Aluja-Banet, Tomas ;
Wrembel, Robert .
IOTBD: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS AND BIG DATA, 2016, :331-338
[13]   A conceptual framework and belief-function approach to assessing overall information quality [J].
Bovee, M ;
Srivastava, RP ;
Mak, B .
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2003, 18 (01) :51-74
[14]   A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal [J].
Camilo Corrales, David ;
Ledezma, Agapito ;
Carlos Corrales, Juan .
JOURNAL OF COMPUTERS, 2015, 10 (06) :396-405
[15]   A new dataset for coffee rust detection in Colombian crops base on classifiers [J].
Camilo Corrales, David ;
Ledezma, Agapito ;
Pena Q., Andres J. ;
Hoyos, Javier ;
Figueroa, Apolinar ;
Carlos Corrales, Juan .
SISTEMAS & TELEMATICA, 2014, 12 (29) :9-23
[16]  
Castiello C, 2005, LECT NOTES ARTIF INT, V3558, P457
[17]  
Charest M., 2006, Artificial Intelligence and Soft Computing, P9
[18]   Bridging the gap between data mining and decision support: A case-based reasoning and ontology approach [J].
Charest, Michel ;
Delisle, Sylvain ;
Cervantes, Ofelia ;
Shen, Yanfen .
INTELLIGENT DATA ANALYSIS, 2008, 12 (02) :211-236
[19]   Intelligent data mining assistance via CBR and ontologies [J].
Charest, Michel ;
Delisle, Sylvain ;
Cervantes, Ofelia ;
Shen, Yanfen .
SEVENTEENTH INTERNATIONAL CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, :593-+
[20]  
Choinski Marcin, 2009, Proceedings of the 2009 International Multiconference on Computer Science and Information Technology (IMCSIT), P147, DOI 10.1109/IMCSIT.2009.5352735