Filtering Chinese microblog topics noise algorithm based on a semi-supervised model

被引:0
作者
Tu S. [1 ]
Yang J. [2 ]
Zhao L. [3 ]
Zhu X. [1 ]
机构
[1] Department of Computer Science and Technology, Tsinghua University, Beijing
[2] CAS Key Laboratory of Network Data Science &Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
[3] State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing
来源
Qinghua Daxue Xuebao/Journal of Tsinghua University | 2019年 / 59卷 / 03期
关键词
K-nearest neighbor; Noise filtering; Penalty cost; Social networks; Support vector machine;
D O I
10.16511/j.cnki.qhdxxb.2019.26.060
中图分类号
学科分类号
摘要
Social networking feeds often include much spam that includes marketing, recruitment or short articles without real content which negatively affect the user interest. The spam also seriously affects academic research and business applications. This paper presents an algorithm based on the pSVM-kNN model for filtering Chinese microblogging text noise to reduce the spam. This method combines the SVM and kNN algorithms. The kNN algorithm iteratively finds the optimal solution of the classification hyperplane in the local scope on the SVM computing hyperplane. Penalty costs and proportional weights are introduced into the SVM and kNN stages to improve the noise filtering and reduce misclassification. Tests on various size of real Sina Weibo datasets demonstrate that the precision and recall of this algorithm are significantly better than other methods with a remarkable improvement of the F-measure. © 2019, Tsinghua University Press. All right reserved.
引用
收藏
页码:178 / 185
页数:7
相关论文
共 18 条
  • [1] Zhao W.X., Jiang J., Weng J.S., Et al., Comparing twitter and traditional media using topic models, Proceedings of the 33rd European Conference on Advances in Information Retrieval, pp. 338-349, (2011)
  • [2] Zhang Y.F., Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation, Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 435-440, (2015)
  • [3] Li D.H., Zhang Y.Q., Chen X., Et al., Propagation regularity of hot topics in sina weibo based on SIR model-a simulation research, Proceedings of 2014 IEEE Computing, Communications and its Applications Conference, pp. 310-315, (2015)
  • [4] Ding X.J., Research on propagation model of public opinion topics based on SCIR in microblogging, Computer Engineering and Applications, 51, 8, pp. 20-26, (2015)
  • [5] Jiang H.C., Lin P., Qiang M.S., Public-opinion sentiment analysis for large hydro projects, Journal of Construction Engineering and Management, 142, 2, (2015)
  • [6] Zhang Y., Sun X.L., Zhu Q.H., A study on communication features and rules of public opinions in public emergency: Taking Sina Microblog and Sina News platform as example, Journal of Intelligence, 33, 4, pp. 90-95, (2014)
  • [7] Yin Z.B., Zhang Y., Chen W.Y., Et al., Discovering patterns of advertisement propagation in sina-microblog, Proceedings of the 6th International Workshop on Data Mining for Online Advertising and Internet Economy, pp. 1-6, (2012)
  • [8] Xie F., Peng Y., Chen S.C., Et al., Security problems in the microblog and their solutions, Netinfo Security, 4, pp. 87-90, (2013)
  • [9] Yin C.Y., Xiang J., Zhang H., Et al., A new SVM method for short text classification based on semi-supervised learning, Proceedings of the 4th International Conference on Advanced Information Technology and Sensor Application, pp. 100-103, (2015)
  • [10] Sriram B., Fuhry D., Demir E., Et al., Short text classification in twitter to improve information filtering, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841-842, (2010)