How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning

被引:25
作者
Camilo Corrales, David [1 ,2 ]
Carlos Corrales, Juan [1 ]
Ledezma, Agapito [2 ]
机构
[1] Univ Cauca, Grp Ingn Telemat, Campus Tulcan, Popayan 190002, Colombia
[2] Univ Carlos III Madrid, Dept Ciencias Computac & Ingn, Ave Univ 30, Leganes 28911, Spain
来源
SYMMETRY-BASEL | 2018年 / 10卷 / 04期
关键词
data cleaning in regression models (DC-RM); data quality issue; data cleaning task; regression model; INFORMATION GAIN; RECORD DATA; FRAMEWORK; SELECTION; IMPUTATION;
D O I
10.3390/sym10040099
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Today, data availability has gone from scarce to superabundant. Technologies like IoT, trends in social media and the capabilities of smart-phones are producing and digitizing lots of data that was previously unavailable. This massive increase of data creates opportunities to gain new business models, but also demands new techniques and methods of data quality in knowledge discovery, especially when the data comes from different sources (e.g., sensors, social networks, cameras, etc.). The data quality process of the data set proposes conclusions about the information they contain. This is increasingly done with the aid of data cleaning approaches. Therefore, guaranteeing a high data quality is considered as the primary goal of the data scientist. In this paper, we propose a process for data cleaning in regression models (DC-RM). The proposed data cleaning process is evaluated through a real datasets coming from the UCI Repository of Machine Learning Databases. With the aim of assessing the data cleaning process, the dataset that is cleaned by DC-RM was used to train the same regression models proposed by the authors of UCI datasets. The results achieved by the trained models with the dataset produced by DC-RM are better than or equal to that presented by the datasets' authors.
引用
收藏
页数:20
相关论文
共 97 条
[1]  
Aljuaid T, 2016, PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON DATA SCIENCE & ENGINEERING (ICDSE), P146
[2]  
Almutiry O, 2013, INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2013), P153
[3]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[4]  
[Anonymous], 2011, International Journal on Computer Science and Engineering
[5]  
[Anonymous], SOCIAL IMPLICATIONS
[6]  
[Anonymous], 2007, Quality measures in data mining
[7]  
[Anonymous], 2011, INT C E BUSINESS E G, DOI DOI 10.1109/ICEBEG.2011.5881298
[8]   Defining and improving data quality in medical registries: A literature review, case study, and generic framework [J].
Arts, DGT ;
de Keizer, NF ;
Scheffer, GJ .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2002, 9 (06) :600-611
[9]  
Asuncion A., 2007, UCI MACHINE LEARNING
[10]   A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm [J].
Aydilek, Ibrahim Berkan ;
Arslan, Ahmet .
INFORMATION SCIENCES, 2013, 233 :25-35