SVOIS: Support Vector Oriented Instance Selection for text classification

被引:17
作者
Tsai, Chih-Fong [1 ]
Chang, Che-Wei [1 ]
机构
[1] Natl Cent Univ, Dept Informat Management, Chungli, Taiwan
关键词
Instance selection; Data reduction; Text classification; Machine learning; Support vector machines; PATTERN-RECOGNITION; ALGORITHMS; REDUCTION;
D O I
10.1016/j.is.2013.05.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic text classification is usually based on models constructed through learning from training examples. However, as the size of text document repositories grows rapidly, the storage requirements and computational cost of model learning is becoming ever higher. Instance selection is one solution to overcoming this limitation. The aim is to reduce the amount of data by filtering out noisy data from a given training dataset A number of instance selection algorithms have been proposed in the literature, such as ENN, IB3, ICF, and DROP3. However, all of these methods have been developed for the k-nearest neighbor (k-NN) classifier. In addition, their performance has not been examined over the text classification domain where the dimensionality of the dataset is usually very high. The support vector machines (SVM) are core text classification techniques. In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS), is proposed. First of all, a regression plane in the original feature space is identified by utilizing a threshold distance between the given training instances and their class centers. Then, another threshold distance, between the identified data (forming the regression plane) and the regression plane, is used to decide on the support vectors for the selected instances. The experimental results based on the TechTC-100 dataset show the superior performance of SVOIS over other state-of-the-art algorithms. In particular, using SVOIS to select text documents allows the k-NN and SVM classifiers perform better than without instance selection. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1070 / 1083
页数:14
相关论文
共 31 条
[1]  
Aggarwal CC, 2001, SIGMOD RECORD, V30, P37
[2]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[3]  
[Anonymous], 1994, Wiley series in probability and mathematical statistics applied probability and statistics
[4]  
[Anonymous], 1997, ICML
[5]  
[Anonymous], 2001, MULTIDIMENSIONAL SCA
[6]  
Brank J., 2002, INT WORKSH TEXT MIN
[7]   Advances in instance selection for instance-based learning algorithms [J].
Brighton, H ;
Mellish, C .
DATA MINING AND KNOWLEDGE DISCOVERY, 2002, 6 (02) :153-172
[8]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[9]   A survey on pattern recognition applications of support vector machines [J].
Byun, H ;
Lee, SW .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2003, 17 (03) :459-486
[10]   Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study [J].
Cano, JR ;
Herrera, F ;
Lozano, M .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2003, 7 (06) :561-575