On a New Model for Automatic Text Categorization Based on Vector Space Model

被引:0
作者
Suzuki, Makoto [1 ]
Yamagishi, Naohide [1 ]
Ishidat, Takashi [2 ]
Gotot, Masayuki [2 ]
Hirasawa, Shigeichi [3 ]
机构
[1] Shonan Inst Technol, Fac Informat Sci, 1-1-25 Tsujido Nishikaigan, Kanagawa 2518511, Japan
[2] Waseda Univ, Shinjuku Ku, Tokyo 169, Japan
[3] Cyber Univ, Shinjuku Ku, Tokyo 162, Japan
来源
IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010) | 2010年
关键词
text mining; classification; N-gram; newspaper;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequencies among categories, and it is able to use index terms without limit. Then, we adopted the Character N-gram to form index terms, thereby improving FRAM. However, FRAM did not have a satisfactory mathematical basis. Therefore, we present here a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, a Japanese CD-Mainichi 2002 data set using the proposed method. The Reuters-2I578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 92.2% for English. The proposed method can perform classification utilizing a single program and it is language-independent.
引用
收藏
页码:3152 / 3159
页数:8
相关论文
共 15 条
[1]  
Aizawa A., 2000, ACM COMP SURV P 23 A, P104
[2]  
[Anonymous], 1995, P 4 ANN S DOCUMENT A
[3]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[4]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[5]  
Cavnar W., 1994, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, V3, P161
[6]  
Joachims T., EUR C MACH LEARN, P137, DOI DOI 10.1007/BFB0026683
[7]  
Lewis D.D., 1994, Third Annual Symposium on Document Analysis and Information Retrieval, P81
[8]  
Namburu S.M., 2005, P IEEE AER C BIG SKY, P1
[9]  
Nathe P., 2005, THESIS COMENIUS U
[10]  
Rastogi R., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P404