An approach for classification of highly imbalanced data using weighting and undersampling

被引：0

作者：

Ashish Anand

Ganesan Pugalenthi

Gary B. Fogel

P. N. Suganthan

机构：

[1] Nanyang Technological University,School of Electrical and Electronic Engineering

[2] Natural Selection,undefined

[3] Inc,undefined

来源：

Amino Acids | 2010年 / 39卷

关键词：

Imbalanced datasets; SVM; Undersampling technique;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.

引用

页码：1385 / 1391

页数：6

共 99 条

[1] Akbani R(2004)Applying support vector machines to imbalanced datasets Lect Notes Comput Sci 3201 39-50
[2] Kwek S(2009)microPred: effective classification of pre-miRNAs for human miRNA gene prediction Bioinformatics 25 989-995
[3] Japkowicz N(2000)The protein data bank Nucl Acids Res 28 235-242
[4] Batuwita R(2004)Editorial: special issue on learning from imbalanced data sets ACM SIGKDD Explor Newsl 6 1-6
[5] Palade V(2009)Sequence-based prediction of protein interaction sites with an integrative method Bioinformatics 25 585-591
[6] Berman HM(2007)Prediction of linear B-cell epitopes using amino acid pair antigenicity scale Amino Acids 33 423-428
[7] Westbrook J(2003)An extensive empirical study of feature selection metrics for text classification J Mach Learn Res 3 1289-1305
[8] Feng Z(2008)AAindex: amino acid index database, progress report 2008 Nucleic Acids Res 36 D202-D205
[9] Gilliland G(2006)Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences Bioinformatics 22 1658-1659
[10] Bhat TN(2009)Exploratory Undersampling for Class-Imbalance Learning IEEE Trans Syst Man Cybern B 39 539-550

← 1 2 3 4 5 6 7 8 9 10 →