PU text classification enhanced by term frequency-inverse document frequency-improved weighting

被引:32
作者
Peng, Tao [1 ,2 ]
Liu, Lu [1 ,2 ]
Zuo, Wanli [1 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
基金
中国国家自然科学基金;
关键词
TF-IDF; TFIPNDF; Classification; 1-DNFC; WVC;
D O I
10.1002/cpe.3040
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Term frequency-inverse document frequency (TF-IDF), one of the most popular feature (also called term or word) weighting methods used to describe documents in the vector space model and the applications related to text mining and information retrieval, can effectively reflect the importance of the term in the collection of documents, in which all documents play the same roles. But, TF-IDF does not take into account the difference of term IDF weighting if the documents play different roles in the collection of documents, such as positive and negative training set in text classification. In view of the aforementioned text, this paper presents a novel TF-IDF-improved feature weighting approach, which reflects the importance of the term in the positive and the negative training examples, respectively. We also build a weighted voting classifier by iteratively applying the support vector machine algorithm and implement one-class support vector machine and Positive Example Based Learning methods used for comparison. During classifying, an improved 1-DNF algorithm, called 1-DNFC, is also adopted, aiming at identifying more reliable negative documents from the unlabeled examples set. The experimental results show that the performance of term frequency inverse positive-negative document frequency-based classifier outperforms that of TF-IDF-based one, and the performance of weighted voting classifier also exceeds that of one-class support vector machine-based classifier and Positive Example Based Learning-based classifier. Copyright (c) 2013 John Wiley & Sons, Ltd.
引用
收藏
页码:728 / 741
页数:14
相关论文
共 23 条
  • [1] [Anonymous], 1998, MACHINE LEARNING ECM, DOI [DOI 10.1007/BFB0026683, 10.1007/BFb0026683]
  • [2] Building text classifiers using positive and unlabeled examples
    Bing, L
    Yang, D
    Li, XL
    Lee, WS
    Yu, PS
    [J]. THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 179 - 186
  • [3] Brank J., 2002, P 19 C MACH LEARN IC
  • [4] Chowdhury GG., 2010, Introduction to Modern Information Retrieval
  • [5] Craven M, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P509
  • [6] Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
  • [7] Boosting multi-label hierarchical text categorization
    Esuli, Andrea
    Fagni, Tiziano
    Sebastiani, Fabrizio
    [J]. INFORMATION RETRIEVAL, 2008, 11 (04): : 287 - 313
  • [8] Gilleron M., 2002, P 9 INT C INF PROC M, P1927
  • [9] HAN E, 1999, THESIS U MINNESOTA
  • [10] Hao HW, 2011, IEEE SYS MAN CYBERN, P850, DOI 10.1109/ICSMC.2011.6083759