Efficient Utilization of Missing Data in Cost-Sensitive Learning

被引:55
作者
Zhu, Xiaofeng [1 ]
Yang, Jianye [2 ]
Zhang, Chengyuan [3 ]
Zhang, Shichao [3 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China
[2] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China
[3] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Hunan, Peoples R China
关键词
Data models; Analytical models; Machine learning; Decision trees; Machine learning algorithms; Knowledge discovery; Computer science; Missing data imputation; cost-sensitive learning; decision tree; classification; imputation order; C4; 5; algorithm; imputation cost; IMPUTATION; ALGORITHM;
D O I
10.1109/TKDE.2019.2956530
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different from previous imputation methods which impute missing values in the incomplete samples by using the information in the complete samples, this paper proposes a Date-drive Incremental imputation Model, DIM for short, which uses all available information in the data set to impute missing values economically, effectively, orderly, and iteratively. To this end, we propose a scoring rule to rank the missing features by taking into account both the economical criterion and the effective imputation information. The economical criterion takes both the imputation cost and the discriminative ability of the feature into account, while the effective imputation information enables to use all observed information in the data set including the imputed missing values to impute the left missing values. During the imputation process, our DIM first detects the neednot-impute samples for reducing the imputation cost and noise, and then selects the missing features with the top rank to impute first. The imputation process orderly imputes the missing features until all missing values are imputed or the imputation cost is exhausted. Experimental results on UCI data sets demonstrated the advantages of our proposed DIM, compared to the comparison methods, in terms of prediction accuracy and classification accuracy.
引用
收藏
页码:2425 / 2436
页数:12
相关论文
共 61 条
[1]  
Aktas M. S., 2019, BIG DATA KNOWLEDGE S, P240
[2]  
[Anonymous], 2015, COST SENSITIVE IMPUT
[3]  
[Anonymous], 2006, ACM SIGKDD EXPLORATI
[4]  
Barbosa-Breda J., 2019, AM STAT, P1
[5]   An embedded imputation method via Attribute-based Decision Graphs [J].
Bertini Junior, Joao Roberto ;
Nicoletti, Maria do Carmo ;
Zhao, Liang .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 57 :159-177
[6]   Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering [J].
Conversano, Claudio ;
Siciliano, Roberta .
JOURNAL OF CLASSIFICATION, 2009, 26 (03) :361-379
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]  
Dick U, 2008, P 25 INT C MACH LEAR, P232
[9]  
El-Fouly R, P INT JOINT C ART IN
[10]   Hierarchical LSTMs with Adaptive Attention for Visual Captioning [J].
Gao, Lianli ;
Li, Xiangpeng ;
Song, Jingkuan ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (05) :1112-1131