A method for Chinese text classification based on apparent semantics and latent aspects

被引:25
作者
Chen, Ye-Wang [1 ]
Wang, Jiong-Liang [1 ]
Cai, Yi-Qiao [1 ]
Du, Ji-Xiang [1 ]
机构
[1] Huaqiao Univ Xiamen, Coll Comp Sci & Technol, Xiamen, Peoples R China
基金
美国国家科学基金会;
关键词
BaiduBaike; Apparent semantics; Latent aspects; Chinese text classification;
D O I
10.1007/s12652-015-0257-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The existing methods for text classification fail to achieve high accuracy in processing Chinese texts, for that the basic unit of Chinese texts is not hanzis but Chinese phrases, and there is no natural delimiter in Chinese texts to separate the phrases. Things go even worse in the case of processing large number of Chinese Web texts, for these texts often lack of enough context, because most of these text are often short, irregular and sparse. In this paper, a new classification method is proposed for Chinese texts based on apparent semantics and latent aspects (ASLA). First, the apparent semantics of Chinese text are extracted as features instead of hanzis by BaiduBaike; Second, pLSA is applied for mining the latent aspects of these apparent semantics. Third, the relevant degree of a document to a category is calculated according to the apparent semantics and latent aspects. Finally, the category of a document is determined by the relevant degree. The proposed method is able to process Chinese web short text well with mini train data. Our experiments showed that the proposed method is promising, and it outperforms pLSA,SVM, KNN and CRF in the case of training data is not enough and the text is irregular.
引用
收藏
页码:473 / 480
页数:8
相关论文
共 26 条
  • [1] [Anonymous], INT J APPL MATH MACH
  • [2] Bharti KK, 2015, SOFT COMPUT, DOI [10.1007/s12652-014-0237-8, DOI 10.1007/S12652-014-0237-8]
  • [3] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [4] An opinion mining framework for Cantonese reviews
    Chen, Jian
    Huang, Dong Ping
    Hu, Shuyue
    Liu, Yu
    Cai, Yi
    Min, Huaqing
    [J]. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2015, 6 (05) : 541 - 547
  • [5] Chen Ye-wang, 2012, Journal of Chinese Computer Systems, V33, P2605
  • [6] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [7] Improving tag-based recommendation with the collaborative value of wiki pages for knowledge sharing
    Durao, Frederico
    Dolog, Peter
    [J]. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2014, 5 (01) : 21 - 38
  • [8] Open-categorical text classification based on multi-LDA models
    Fu, Ruiji
    Qin, Bing
    Liu, Ting
    [J]. SOFT COMPUTING, 2015, 19 (01) : 29 - 38
  • [9] Fudan NLP, 2013, CHINESE TEXTS DATABA
  • [10] Hofmann T, 1999, UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, P289