Class-index corpus-index measure: A novel feature selection method for imbalanced text data

被引:7
作者
Parlak, Bekir [1 ]
机构
[1] Amasya Univ, Dept Comp Engn, Amasya, Turkey
关键词
feature selection; imbalanced datasets; text classification; CLASSIFICATION; SCHEME;
D O I
10.1002/cpe.7140
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the field of text classification, some of the datasets are unbalanced datasets. In these datasets, feature selection stage is important to increase performance. There are many studies in this area. However, existing methods have been developed based on the document frequency of only intra-class. In this study, a new method is proposed considering the situation of the feature in class and corpus. A new feature selection method, namely class-index corpus-index measure (CiCi) was presented for unbalanced text classification. The CiCi is a probabilistic method which is calculated using feature distribution in both class and corpus. It has shown a higher performance compared to successful methods in the literature. Multinomial Naive Bayes and support vector machines were used as classifiers in the experiments. Three different unbalanced datasets are used in the experiments. These benchmark datasets are reuters-21578, ohsumed, and enron1. Experimental results show that the proposed method has more performance in terms of three different success measures.
引用
收藏
页数:12
相关论文
empty
未找到相关数据