Micro-blog Commercial Word Extraction Based On Improved TF-IDF Algorithm

被引:0
作者
Huang, Xing [1 ]
Wu, Qing [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou 310018, Zhejiang, Peoples R China
来源
2013 IEEE INTERNATIONAL CONFERENCE OF IEEE REGION 10 (TENCON) | 2013年
关键词
Micro-blog; Commercial Word Extract; TF-IDF; Hadoop; Mass Data;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Nowadays found some micro-blog commercial extraction algorithm only considering the relationship between the key words and the number of it appearing in texts, and ignoring the key words' distribution in a certain category, which leads the decreased accuracy problems of micro-blog commercial word extraction. To solve this problem, the application of TF-IDF algorithm in words weight calculation was researched in this paper. Combining the relevant knowledge of information theory and analyzing the distribution of keywords within a class, the article proposed improving TF-IDF algorithm and applying it in term weight calculation. To test the feasibility of the improved algorithm, this paper initially classified the massive micro-blog information into certain types, and then used improved TF-IDF algorithm to calculate term weight among the categories, and, this calculation was realized under the Hadoop Distributed framework. The experiment results demonstrated that in the application of micro-blog commercial word extraction, the improved TF-IDF algorithm is effective and feasible. Compared with traditional algorithms, the improved algorithm greatly improved accuracy. In addition, the data processing speed has greatly improved under Hadoop framework.
引用
收藏
页数:5
相关论文
共 12 条
  • [1] [Anonymous], 2004, OSDI
  • [2] Cui Zhengyan, 2010, MODERN COMPUTER, P18
  • [3] Ghemawat Sanjay., 2003, SOSP'03
  • [4] Jansen B., 2009, Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, P3859, DOI [DOI 10.1145/1520340.1520584, 10.1145/1520340.1520584]
  • [5] Li Hai-rong, 2011, LIB INFORM SERVICE, V55, P106
  • [6] Lian Jie, 2011, Journal of Tsinghua University (Science and Technology), V51, P1300
  • [7] Lin Li, 2012, CLOUD GREEN COMPUTIN
  • [8] Liu Z., 2010, COMPUTATIONAL INTELL
  • [9] Mining the interests of Chinese microbloggers via keyword extraction
    Liu, Zhiyuan
    Chen, Xinxiong
    Sun, Maosong
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2012, 6 (01) : 76 - 87
  • [10] Qin Zhenhua, 2011, INTELLIGENT COMPUTIN, P168