The data complexity index to construct an efficient cross-validation method

被引:20
作者
Li, Der-Chiang [1 ]
Fang, Yao-Hwei [2 ]
Fang, Y. M. Frank [3 ]
机构
[1] Natl Cheng Kung Univ, Dept Ind & Informat Management, Tainan 70101, Taiwan
[2] Natl Hlth Res Inst, Div Biostat & Bioinformat, Miaoli, Taiwan
[3] Feng Chia Univ, Geog Informat Syst Res Ctr, Dept Civil & Hydraul Engn, Taichung, Taiwan
关键词
Binary classification problem; Cross-validation; Data complexity; NOISY DATA; CLASSIFICATION; ALGORITHM;
D O I
10.1016/j.dss.2010.07.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:93 / 102
页数:10
相关论文
共 24 条
[1]  
[Anonymous], 2006, Introduction to Data Mining
[2]  
[Anonymous], 2006, Pattern recognition and machine learning
[3]   Modified support vector novelty detector using training data with outliers [J].
Cao, LJ ;
Lee, HP ;
Chong, WK .
PATTERN RECOGNITION LETTERS, 2003, 24 (14) :2479-2487
[4]  
Casella G., 2002, Statistical inference, V2nd edition
[5]   The properties of high-dimensional data spaces: implications for exploring gene and protein expression data [J].
Clarke, Robert ;
Ressom, Habtom W. ;
Wang, Antai ;
Xuan, Jianhua ;
Liu, Minetta C. ;
Gehan, Edmund A. ;
Wang, Yue .
NATURE REVIEWS CANCER, 2008, 8 (01) :37-49
[6]   Representative subset selection [J].
Daszykowski, M ;
Walczak, B ;
Massart, DL .
ANALYTICA CHIMICA ACTA, 2002, 468 (01) :91-103
[7]   Looking for natural patterns in data - Part 1. Density-based approach [J].
Daszykowski, M ;
Walczak, B ;
Massart, DL .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2001, 56 (02) :83-92
[8]  
Ester M., 1996, DENSITY BASED ALGORI, V96, P226, DOI DOI 10.5555/3001460
[9]  
Hagan M. T., 1997, Neural network design
[10]   Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification [J].
Han, Hyoungdong ;
Ko, Youngjoong ;
Seo, Jungyun .
INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (05) :1281-1293