The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization

被引:0
作者
Hu, Yan [1 ]
Wu, Wei [1 ]
Miao, Miao [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China
来源
IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS | 2009年
关键词
Automatic Construction; Large-scale Corpus; Chinese Text Categorization;
D O I
10.1109/IEEC.2009.141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language using and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of Corpus research has become the obstruction of information technology development. Therefore, by analyzing the characteristics of Chinese categorization corpus, combining with Internet which is the largest knowledge base at present and depending on the search capability of search engines, this paper proposes and realizes a kind of algorithm on lager-scale corpus for Chinese text categorization. Experiments show that the corpus constructed by this algorithm performance well in various classifiers. It has a certain practical value.
引用
收藏
页码:640 / 645
页数:6
相关论文
empty
未找到相关数据