Text categorization using distributional clustering and concept extraction

被引:0
作者
He, Yifan [1 ]
Jiang, Minghu [1 ]
机构
[1] Tsinghua Univ, Sch Human & Social Sci, Lab Computat Linguist, Beijing 100084, Peoples R China
来源
ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS: WITH ASPECTS OF THEORETICAL AND METHODOLOGICAL ISSUES | 2007年 / 4681卷
关键词
text categorization; feature selection; distributional clustering;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization (TC) has become one the most researched fields in NLP. In this paper, we try to solve the problem of TC through a 2-step feature selection approach. First we cluster the words that appear in the texts according to their distribution in categories. Then we extract concepts from these clusters, which are DEF terms in HowNet. The extraction is according to the word clusters instead of single words. This method maintains the generalization ability of concept extraction based TC and at the same time makes full use of the occurrences of new words that are not found in concept thesaurus. We test the performance of our feature selection method on the Sogou corpus for TC with an SVM classifier. Results of our experiments show that our method can improve the performance of TC in all categories.
引用
收藏
页码:720 / +
页数:2
相关论文
共 10 条
[1]   A new text categorization technique using distributional clustering and learning logic [J].
Al-Mubaid, Hisham ;
Umair, Syed A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (09) :1156-1165
[2]  
BEKKERMAN R, 2001, P SIGIR 01 24 ACM IN, P146
[3]  
Liao SS, 2005, LECT NOTES COMPUT SC, V3610, P1140
[4]  
PENG F, 2003, P 6 INT WORKSH INF R, V11, P41
[5]   Machine learning in automated text categorization [J].
Sebastiani, F .
ACM COMPUTING SURVEYS, 2002, 34 (01) :1-47
[6]  
Slonim N, 2000, ADV NEUR IN, V12, P617
[7]  
SLONIM N, 2000, P 23 ANN INT ACM SIG, P208, DOI DOI 10.1145/345508.345578
[8]  
SLONIN N, 2002, THESIS HEBREW U
[9]  
Tishby N, 1999, The information bottleneck method. pages, V37, P368
[10]  
[张剑 Zhang Jian], 2006, [计算机工程与应用, Computer Engineering and Application], V42, P174