Improving Text Categorization with Semantic Knowledge in Wikipedia

被引:11
作者
Wang, Xiang [1 ]
Jia, Yan [1 ]
Chen, Ruhua [1 ]
Fan, Hua [1 ]
Zhou, Bin [1 ]
机构
[1] Natl Univ Def Technol, Sch Comp, Changsha, Hunan, Peoples R China
关键词
text categorization; Wikipedia; document representation; semantic matrix;
D O I
10.1587/transinf.E96.D.2786
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with "Bag of Words (BOW)" text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wildpedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
引用
收藏
页码:2786 / 2794
页数:9
相关论文
共 22 条
  • [1] Banerjee Somnath, 2007, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P787, DOI 10.1145/1277741.1277909
  • [2] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [3] Chen M., 2011, IJCAI, P1776, DOI DOI 10.5591/978-1-57735-516-8/IJCAI11-298
  • [4] Chen YW, 2006, STUD FUZZ SOFT COMP, V207, P315
  • [5] Ferragina P, 2010, P 19 ACM INT C INF K, P1625, DOI DOI 10.1145/1871437.1871689
  • [6] Gabrilovich E., 2006, AAAI, P1301
  • [7] Gabrilovich E, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P1048
  • [8] Gabrilovich E, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1606
  • [9] Hersh W., 1994, P 17 ANN INT ACM SIG, P192
  • [10] Hu XH, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P389