On a New Model for Automatic Text Categorization Based on Vector Space Model

被引：0

作者：

Suzuki, Makoto ^{[1
]}

Yamagishi, Naohide ^{[1
]}

Ishidat, Takashi ^{[2
]}

Gotot, Masayuki ^{[2
]}

Hirasawa, Shigeichi ^{[3
]}

机构：

[1] Shonan Inst Technol, Fac Informat Sci, 1-1-25 Tsujido Nishikaigan, Kanagawa 2518511, Japan

[2] Waseda Univ, Shinjuku Ku, Tokyo 169, Japan

[3] Cyber Univ, Shinjuku Ku, Tokyo 162, Japan

来源：

IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010) | 2010年

关键词：

text mining; classification; N-gram; newspaper;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequencies among categories, and it is able to use index terms without limit. Then, we adopted the Character N-gram to form index terms, thereby improving FRAM. However, FRAM did not have a satisfactory mathematical basis. Therefore, we present here a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, a Japanese CD-Mainichi 2002 data set using the proposed method. The Reuters-2I578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 92.2% for English. The proposed method can perform classification utilizing a single program and it is language-independent.

引用

页码：3152 / 3159

页数：8

共 15 条

[1]

Aizawa A., 2000, ACM COMP SURV P 23 A, P104

[2]

[Anonymous], 1995, P 4 ANN S DOCUMENT A

[3] AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].

APTE, C ;

DAMERAU, F ;

WEISS, SM .

ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251

[4] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

[5]

Cavnar W., 1994, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, V3, P161

[6]

Joachims T., EUR C MACH LEARN, P137, DOI DOI 10.1007/BFB0026683

[7]

Lewis D.D., 1994, Third Annual Symposium on Document Analysis and Information Retrieval, P81

[8]

Namburu S.M., 2005, P IEEE AER C BIG SKY, P1

[9]

Nathe P., 2005, THESIS COMENIUS U

[10]

Rastogi R., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P404

← 1 2 →