An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

被引:115
作者
Lee, Lam Hong [1 ]
Wan, Chin Heng [1 ]
Rajkumar, Rajprasad [2 ]
Isa, Dino [2 ]
机构
[1] Univ Tunku Abdul Rahman, Fac Informat & Commun Technol, Kampar 31900, Perak, Malaysia
[2] Univ Nottingham, Fac Engn, Intelligent Syst Res Grp, Semenyih 43500, Selangor, Malaysia
关键词
Text document classification; Support Vector Machine; Euclidean distance function; Kernel function; Soft margin parameter; KERNEL PARAMETERS; LEARNING-METHODS;
D O I
10.1007/s10489-011-0314-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the necessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C.
引用
收藏
页码:80 / 99
页数:20
相关论文
共 68 条
[11]  
Callut J, 2008, LECT NOTES ARTIF INT, V5211, P162, DOI 10.1007/978-3-540-87479-9_29
[12]  
Cardoso-Cachopo A, 2011, DATASETS SINGLE LABE
[13]   Fast and accurate text classification via multiple linear discriminant projections [J].
Chakrabarti, S ;
Roy, S ;
Soundalgekar, MV .
VLDB JOURNAL, 2003, 12 (02) :170-185
[14]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[15]   A hierarchical neural network document classifier with linguistic feature selection [J].
Chen, CM ;
Lee, HM ;
Hwang, CW .
APPLIED INTELLIGENCE, 2005, 23 (03) :277-294
[16]   Feature selection for text classification with Naive Bayes [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Shengfeng ;
Qu, Youli .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435
[17]  
Craven M, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P509
[18]   Authorship attribution with support vector machines [J].
Diederich, J ;
Kindermann, O ;
Leopold, E ;
Paass, G .
APPLIED INTELLIGENCE, 2003, 19 (1-2) :109-123
[19]   Improving classification performance of Support Vector Machine by genetically optimising kernel shape and hyper-parameters [J].
Diosan, Laura ;
Rogozan, Alexandrina ;
Pecuchet, Jean-Pierre .
APPLIED INTELLIGENCE, 2012, 36 (02) :280-294
[20]   On the optimality of the simple Bayesian classifier under zero-one loss [J].
Domingos, P ;
Pazzani, M .
MACHINE LEARNING, 1997, 29 (2-3) :103-130