Research on Topic Recognition of Network Sensitive information Based on SW-LDA Model

被引:14
作者
Xu, Guixian [1 ]
Wu, Xu [1 ]
Yao, Haishen [1 ]
Li, Fan [1 ]
Yu, Ziheng [1 ]
机构
[1] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
Sensitive information; topic recognition; word embedding; artificial intelligence;
D O I
10.1109/ACCESS.2019.2897475
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The mining of network sensitive information is of great significance for understanding the social stability of the network. Obtaining the network public opinion of sensitive information is helpful to master Internet users' attitudes toward important social events. The related artificial intelligence technology can achieve the topics from the network texts. At present, the current topic recognition model has a low recognition rate for sensitive information and usually generates some inaccurate topic keywords. In this paper, a topic recognition method of the network sensitive information based on a sensitive word weighted-latent Dirichlet allocation (LDA) model is proposed. First, the basic sensitive word vocabulary is constructed by manual collection, and the embedding representation of the word is obtained through the training of a large amount of network corpus based on Word2vec. The semantic similarity between the word embedding is calculated to extend the basic sensitive word vocabulary. Second, the extended sensitive word vocabulary is embedded in the LDA model. On the one hand, it can improve the semantic understanding and the recognition ability of LDA for the sensitive topic words and promote the quality of the generated topic words. On the other hand, it can also improve the relevance of the topic keywords and the related topics and find more fine-grained keywords. The experimental results show that the sensitive word weighted-LDA model can effectively improve the topic recognition quantity and quality of sensitive information. This paper is helpful to the development of artificial intelligence. The generated corpus in this paper is meaningful to the research of text classification, clustering and information retrieval, and so on.
引用
收藏
页码:21527 / 21538
页数:12
相关论文
共 37 条
  • [1] [Anonymous], 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, DOI DOI 10.5591/978-1-57735-516-8/IJCAI11-298
  • [2] [Anonymous], 1989, 8 C COGNITIVE SCI SO
  • [3] Bengio Y, 2001, ADV NEUR IN, V13, P932
  • [4] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [5] Cao Z., 2011, SCI TECHNOL HERALD, V29, P15
  • [6] DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
  • [7] 2-9
  • [8] Faliang H., 2016, CHINESE J ELECTRON, V44, P1887
  • [9] Feifei Peng, 2012, 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), P185, DOI 10.1109/IHMSC.2012.53
  • [10] Fast incremental LDA feature extraction
    Ghassabeh, Youness Aliyari
    Rudzicz, Frank
    Moghaddam, Hamid Abrishami
    [J]. PATTERN RECOGNITION, 2015, 48 (06) : 1999 - 2012