An approach for classification of highly imbalanced data using weighting and undersampling

被引:0
作者
Ashish Anand
Ganesan Pugalenthi
Gary B. Fogel
P. N. Suganthan
机构
[1] Nanyang Technological University,School of Electrical and Electronic Engineering
[2] Natural Selection,undefined
[3] Inc,undefined
来源
Amino Acids | 2010年 / 39卷
关键词
Imbalanced datasets; SVM; Undersampling technique;
D O I
暂无
中图分类号
学科分类号
摘要
Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.
引用
收藏
页码:1385 / 1391
页数:6
相关论文
共 99 条
  • [1] Akbani R(2004)Applying support vector machines to imbalanced datasets Lect Notes Comput Sci 3201 39-50
  • [2] Kwek S(2009)microPred: effective classification of pre-miRNAs for human miRNA gene prediction Bioinformatics 25 989-995
  • [3] Japkowicz N(2000)The protein data bank Nucl Acids Res 28 235-242
  • [4] Batuwita R(2004)Editorial: special issue on learning from imbalanced data sets ACM SIGKDD Explor Newsl 6 1-6
  • [5] Palade V(2009)Sequence-based prediction of protein interaction sites with an integrative method Bioinformatics 25 585-591
  • [6] Berman HM(2007)Prediction of linear B-cell epitopes using amino acid pair antigenicity scale Amino Acids 33 423-428
  • [7] Westbrook J(2003)An extensive empirical study of feature selection metrics for text classification J Mach Learn Res 3 1289-1305
  • [8] Feng Z(2008)AAindex: amino acid index database, progress report 2008 Nucleic Acids Res 36 D202-D205
  • [9] Gilliland G(2006)Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences Bioinformatics 22 1658-1659
  • [10] Bhat TN(2009)Exploratory Undersampling for Class-Imbalance Learning IEEE Trans Syst Man Cybern B 39 539-550