A Study of Chinese Document Representation and Classification with Word2vec

被引:0
作者
Zhu, Lei [1 ]
Wang, Guijun [1 ]
Zou, Xiancun [1 ]
机构
[1] Southwest Univ, Sch Comp & Informat Sci, Chongqing 400715, Peoples R China
来源
PROCEEDINGS OF 2016 9TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), VOL 1 | 2016年
关键词
tf-idf; word2vec; text classification;
D O I
10.1109/ISCID.2016.74
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Word2vec is a neural network language model which can convert words and phrases into a high-quality distributed vector (called word embedding) with semantic word relationships, so it offers a unique perspective to the text classification and other natural language processing (NLP) tasks. In this paper, we propose to combine improved tf-idf algorithm and word embedding as a way to represent documents and conduct text classification experiments on the Sogou Chinese classification corpus. Our results show that the combination of word embedding and improved tf-idf algorithm can outperform either individually.
引用
收藏
页码:298 / 302
页数:5
相关论文
共 24 条
[1]  
[Anonymous], 2007, P 24 INT C MACH LEAR, DOI DOI 10.1145/1273496.1273577
[2]  
[Anonymous], 2013, ACL
[3]  
[Anonymous], 2013, 51 ANN M ASS COMP LI
[4]  
Bengio Y, 2006, STUD FUZZ SOFT COMP, V194, P137
[5]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[6]  
Dumais ST, 2004, ANNU REV INFORM SCI, V38, P189
[7]  
Haruechaiyasak Choochart, 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, P143, DOI 10.1109/WIIAT.2008.61
[8]   Probabilistic latent semantic indexing [J].
Hofmann, T .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :50-57
[9]   Word Sense Disambiguation based on Relation Structure [J].
Hwang, Myunggwon ;
Choi, Chang ;
Youn, Byungsu ;
Kim, Pankoo .
ALPIT 2008: SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, :15-+
[10]  
Kim K. H., 2015, BAG OF CONCEPTS COMP