Analysis of sampling techniques for imbalanced data: An n=648 ADNI study

被引:139
|
作者
Dubey, Rashmi [1 ,2 ]
Zhou, Jiayu [1 ,2 ]
Wang, Yalin [1 ]
Thompson, Paul M. [3 ]
Ye, Jieping [1 ,2 ]
机构
[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA
[2] Arizona State Univ, Biodesign Inst, Ctr Evolutionary Med & Informat, Tempe, AZ 85287 USA
[3] Univ Calif Los Angeles, Sch Med, Imaging Genet Ctr, Lab Neuro Imaging, Los Angeles, CA USA
基金
加拿大健康研究院; 美国国家科学基金会; 美国国家卫生研究院;
关键词
Alzheimer's disease; Classification; Imbalanced data; Undersampling; Oversampling; Feature selection; ALZHEIMERS-DISEASE; CLASSIFICATION; MRI; HIPPOCAMPAL; ASSOCIATION; PREDICTION; BIOMARKERS; SIGNATURE; DIAGNOSIS; ATROPHY;
D O I
10.1016/j.neuroimage.2013.10.005
中图分类号
Q189 [神经科学];
学科分类号
071006 ;
摘要
Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results. (C) 2013 Elsevier Inc. All rights reserved.
引用
收藏
页码:220 / 241
页数:22
相关论文
共 50 条
  • [1] Analysis of Sampling Techniques Towards Epileptic Seizure Detection from Imbalanced Dataset
    Masum, Mohammad
    Shahriar, Hossain
    Haddad, Hisham
    2020 IEEE 44TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2020), 2020, : 684 - 692
  • [2] The study of preprocessing methods' utility in analysis of multidimensional and highly imbalanced medical data
    Werner, Aleksandra
    Bach, Malgorzata
    Pluskiewicz, Wojciech
    PROCEEDINGS OF THE 11TH SCIENTIFIC CONFERENCE INTERNET IN THE INFORMATION SOCIETY 2016, 2016, : 71 - 87
  • [3] Model-Based Synthetic Sampling for Imbalanced Data
    Liu, Chien-Liang
    Hsieh, Po-Yen
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (08) : 1543 - 1556
  • [4] Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques
    Thanathamathee, Putthiporn
    Lursinsap, Chidchanok
    PATTERN RECOGNITION LETTERS, 2013, 34 (12) : 1339 - 1347
  • [5] Severely imbalanced Big Data challenges: investigating data sampling approaches
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey L.
    Bauder, Richard A.
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [6] A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets
    Saul, Marcia Amstelvina
    Rostami, Shahin
    ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS (UKCI), 2019, 840 : 240 - 251
  • [7] CSS: Handling imbalanced data by improved clustering with stratified sampling
    Cao, Lu
    Shen, Hong
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (02)
  • [8] An Active Under-sampling Approach for Imbalanced Data Classification
    Yang, Zeping
    Gao, Daqi
    2012 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2012), VOL 2, 2012, : 270 - 273
  • [9] Imbalanced Data Classification Based on Feature Selection Techniques
    Ksieniewicz, Pawel
    Wozniak, Michal
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2018), PT II, 2018, 11315 : 296 - 303
  • [10] The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis
    Bach, M.
    Werner, A.
    Zywiec, J.
    Pluskiewicz, W.
    INFORMATION SCIENCES, 2017, 384 : 174 - 190