Research on Tibetan hot words, sensitive words tracking and public opinion classification

被引:0
作者
Guixian Xu
Changzhi Wang
Haishen Yao
Qi Qi
机构
[1] Minzu University of China,Information Engineering College
来源
Cluster Computing | 2019年 / 22卷
关键词
Web crawler; Tibetan hot words; Term weight computing; Sensitive words discovery; Text classification;
D O I
暂无
中图分类号
学科分类号
摘要
The rapid development of Tibetan information technology provides rich resources for Tibetan information processing technology. The construction of Tibetan corpus is the field of Tibetan information processing of basic work. In this paper, we design the system of Tibetan network data collection and web pages preprocessing. It can timely and efficiently access to web resources, and provide a basis for further analysis of Tibetan data. It can establish the Tibetan related corpus, enrich the Tibetan digital resources. It can also alleviate the status of Tibetan corpus data sparse and lack of resources and bring the convenient condition for Tibetan information processing. The hot words reflect the hot spot of Tibetan people’s attention in a certain period of time. Firstly, the paper proposes the method for reducing the space dimension of Tibetan news text. It can effectively reduce the complexity of subsequent processing. Secondly, term weighting method is proposed based on improved TFIDF for Tibetan text information extraction. It utilizes the idea that the words of different locations are given different weights to extract the hot words. On sensitive words discovery and classification of public opinion, sensitive thesaurus are collected artificially. Through the sensitive thesaurus comparison, the sensitive words are extracted. Classification of public opinion words is based on the proposed classification formula and the public opinion thesaurus. It will classify one Tibetan text to one public opinion class. In this paper, the software is developed to automatically collect Tibetan web pages from the network, preprocess the web pages, extract the text features and hot words, discover the sensitive words and classify the Tibetan text to one public opinion class. The experiment shows that the Tibetan hot words extraction is effective and Tibetan classification results of public opinion are significant.
引用
收藏
页码:9977 / 9990
页数:13
相关论文
共 49 条
  • [1] Gao DG(2009)Retrospect on the development of Tibetan information processing technology J. Tibet Univ. 24 18-27
  • [2] Guan B(2011)Hot-word detection for internet public sentiment J. Chin. Inf. Process. 25 49-53
  • [3] Li YQ(2013)Data analyses of large basic Tibetan corpus J. Northwest Univ. Natl. 34 46-51
  • [4] Sun LH(2008)Construction approach of large-scale corpus based on web Comp. Eng. 34 41-46
  • [5] Gao DG(2015)Mining Tibetan web text resources and its application J. Chin. Inf. Process. 29 170-177
  • [6] Tashigyal DC(2009)Application of WebCrawler in information search and data mining Comput. Eng. Des. 30 5658-5662
  • [7] Zhao PF(2016)A web sentiment analysis method on fuzzy clustering for mobile social media users Eurasip J. Wirel. Commun. Netw. 2016 1-13
  • [8] Li QM(2011)Parsing DOM tree reversely and extracting web main page information Comput. Sci. 38 213-215
  • [9] Zhu PD(2009)Research on near-duplicate detection algorithm shingling and simhash Comput. Digit. Eng. 39 15-17
  • [10] Qian HD(2014)A method of intelligence key words extraction based on improved TF-IDF J. Intell. 4 028-620