Abstract feature extraction for text classification

被引:18
作者
Biricik, Goksel [1 ]
Diri, Banu [1 ]
Sonmez, Ahmet Coskun [1 ]
机构
[1] Yildiz Tech Univ, Dept Comp Engn, Istanbul, Turkey
关键词
Dimensionality reduction; feature extraction; preprocessing for classification; probabilistic abstract features; DIMENSIONALITY REDUCTION; FEATURE-SELECTION;
D O I
10.3906/elk-1102-1015
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection and extraction are frequently used solutions to overcome the curse of dimensionality in text classification problems. We introduce an extraction method that summarizes the features of the document samples; where the new features aggregate information about how much evidence there is in a document, for each class. We project the high dimensional features of documents onto a new feature space having dimensions equal to the number of classes in order to form the abstract features. We test our method on 7 different text classification algorithms, with different classifier design approaches. We examine performances of the classifiers applied on standard text categorization test collections and show the enhancements achieved by applying our extraction method. We compare the classification performance results of our method with popular and well-known feature selection and feature extraction schemes. Results show that our summarizing abstract feature extraction method encouragingly enhances classification performances on most of the classifiers when compared with other methods.
引用
收藏
页码:1137 / 1159
页数:23
相关论文
共 32 条
[1]  
[Anonymous], 1997, ICML
[2]   Fast low-rank modifications of the thin singular value decomposition [J].
Brand, M .
LINEAR ALGEBRA AND ITS APPLICATIONS, 2006, 415 (01) :20-30
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   Adaptive local dissimilarity measures for discriminative dimension reduction of labeled data [J].
Bunte, Kerstin ;
Hammer, Barbara ;
Wismueller, Axel ;
Biehl, Michael .
NEUROCOMPUTING, 2010, 73 (7-9) :1074-1092
[5]  
Chan L.M., 1994, Cataloging and Classification: an Introduction, V2nd
[6]  
Cohen W.W., 1995, P 12 INT C MACH LEAR, P115, DOI [10.1016/b978-1-55860-377-6.50023-2, DOI 10.1016/B978-1-55860-377-6.50023-2]
[7]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[8]  
Dragos A.M., 1998, P 5 C PATT LANG PROG, P391
[9]   Query expansion and dimensionality reduction: Notions of optimality in Rocchio relevance feedback and latent semantic indexing [J].
Efron, Miles .
INFORMATION PROCESSING & MANAGEMENT, 2008, 44 (01) :163-180
[10]  
Fan RE, 2008, J MACH LEARN RES, V9, P1871