Automatic Text Categorization by a Granular Computing Approach: facing Unbalanced Data Sets

被引:0
作者
Possemato, Francesca [1 ]
Rizzi, Antonello [1 ]
机构
[1] Univ Rome, SAPIENZA, Dept Informat Engn Elect & Telecommun, I-00184 Rome, Italy
来源
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2013年
关键词
Text categorization; Granular computing; Frequent substructures mining; Unbalanced data sets;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization is an interesting application of machine learning covering a wide range of possible applications, from document management systems to web mining. In designing such a system it is mandatory to correctly define both a suited preprocessing procedure and an effective document representation as closely related as possible to the semantic nature of document categories. To this aim, relying on a Granular Computing approach and considering a document as an ordered sequence of words, we propose a system able to automatically mine frequent terms, considering as a term not only a single word, but also a subsequence of (a few) consecutive words. The whole classification system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering methods. However, when dealing with unbalanced data sets, i.e. when classes are not evenly represented in the data set, the frequent substructures search procedure must be carefully designed. We prove the effectiveness of the system over a well-known benchmarking data set, achieving competitive test set classification accuracy results, with a remarkable low structural complexity of the synthesized classification models.
引用
收藏
页数:8
相关论文
共 16 条
[1]  
[Anonymous], 2012, P 1 INT C PATTERN RE, DOI DOI 10.5220/0003733201860191
[2]  
[Anonymous], 1997, ICML
[3]  
Bargiela A., 2003, KLUWER INT SERIES EN
[4]   A Granular Computing approach to the design of optimized graph classification systems [J].
Bianchi, Filippo Maria ;
Livi, Lorenzo ;
Rizzi, Antonello ;
Sadeghian, Alireza .
SOFT COMPUTING, 2014, 18 (02) :393-412
[5]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6]  
Debole F, 2004, STUD FUZZ SOFT COMP, V138, P81
[7]  
Del Vescovo G., 2014, INT J COMPUTER THEOR, V6
[8]  
Joachims T., EUR C MACH LEARN, P137, DOI DOI 10.1007/BFB0026683
[9]  
Lan M, 2007, IEEE IJCNN, P2556
[10]   Supervised and Traditional Term Weighting Methods for Automatic Text Categorization [J].
Lan, Man ;
Tan, Chew Lim ;
Su, Jian ;
Lu, Yue .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2009, 31 (04) :721-735