Feature selection method on imbalanced text

被引:1
作者
Liao, Yi-Xing [1 ,2 ]
Pan, Xue-Zeng [1 ]
机构
[1] College of Computer Science and Technology, Zhejiang University
[2] Department of Information, Zhejiang University of Finance and Economics
来源
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China | 2012年 / 41卷 / 04期
关键词
Feature selection; Imbalanced dataset; Strong class-related; Text classification;
D O I
10.3969/j.issn.1001-0548.2012.04.022
中图分类号
学科分类号
摘要
After analyzing the four basic information elements of traditional feature selection methods, a new measurement of strong class information is introduced and a new feature selection method is proposed for imbalanced text classification. The strong class information and the frequency of terms are used to improve the classification performance of minority classes and majority classes respectively. The experiments on reuter-21578 dataset show that the proposed method is better than IG and CHI. Both Micro F 1 and Macro F 1 are improved to some degree.
引用
收藏
页码:592 / 595
页数:3
相关论文
共 12 条
[1]  
Yang Y.-M., Pedersen J.O., A comparative study on feature selection in text categorization, Proceedings of ICML, (1997)
[2]  
Liu Y., Han T.L., Aixin S., Imbalanced text classification: A term weighting approach, Expert Systems with Application, 36, 1, pp. 690-701, (2009)
[3]  
Mladenic D., Grobelnk M., Feature selection for unbalanced class distribution and naïve bayes, Proc of the 16th International Conf Machine Learning, (1999)
[4]  
Bong C.H., Narayanan K., An empirical study of feature selection for text categorization based on term weight, IEEE International Conference on Web Intelligence, (2004)
[5]  
Li S.-S., Zong C.-Q., A new approach to feature selection for text categorization, IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), (2005)
[6]  
Cardie C., Nowe N., Improving minority class predicting using case-specific feature weights, Proceedings of the 14th International Conference on Machine Learning, (1997)
[7]  
Castillo M.D.D., Serrano J.I., A multi-strategy approach for digital text categorization from imbalanced documents, SIGKDD Explorations, 6, 1, pp. 70-79, (2004)
[8]  
Zheng Z.H., Optimally combining positive and negative features for text categorization, ICML2003, (2005)
[9]  
Forman G., An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, 3, 1, pp. 1289-1305, (2003)
[10]  
Xu Y., Li J.-T., Wang B., Et al., A study of feature selection for text categorization on imbalanced data, Journal of Computer Research and Development, z2, pp. 58-62, (2007)