Analysis of sampling techniques for imbalanced data: An n=648 ADNI study

被引：139

作者：

Dubey, Rashmi ^{[1
,2
]}

Zhou, Jiayu ^{[1
,2
]}

Wang, Yalin ^{[1
]}

Thompson, Paul M. ^{[3
]}

Ye, Jieping ^{[1
,2
]}

机构：

[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA

[2] Arizona State Univ, Biodesign Inst, Ctr Evolutionary Med & Informat, Tempe, AZ 85287 USA

[3] Univ Calif Los Angeles, Sch Med, Imaging Genet Ctr, Lab Neuro Imaging, Los Angeles, CA USA

来源：

NEUROIMAGE | 2014年 / 87卷

基金：

美国国家卫生研究院; 加拿大健康研究院; 美国国家科学基金会;

关键词：

Alzheimer's disease; Classification; Imbalanced data; Undersampling; Oversampling; Feature selection; ALZHEIMERS-DISEASE; CLASSIFICATION; MRI; HIPPOCAMPAL; ASSOCIATION; PREDICTION; BIOMARKERS; SIGNATURE; DIAGNOSIS; ATROPHY;

D O I：

10.1016/j.neuroimage.2013.10.005

中图分类号：

Q189 [神经科学];

学科分类号：

071006 ;

摘要：

Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results. (C) 2013 Elsevier Inc. All rights reserved.

引用

页码：220 / 241

页数：22

共 50 条

[21] A resistance outlier sampling algorithm for imbalanced data prediction
Pan, Xiaoying
Jia, Rong
Huang, Jiahao
Wang, Hao
INTELLIGENT DATA ANALYSIS, 2022, 26 (03) : 583 - 598
[22] Rough Sets in Imbalanced Data Problem: Improving Re-sampling Process
Borowska, Katarzyna
Stepaniuk, Jaroslaw
COMPUTER INFORMATION SYSTEMS AND INDUSTRIAL MANAGEMENT (CISIM 2017), 2017, 10244 : 459 - 469
[23] Statistic deviation mode balancer (SDMB): A novel sampling algorithm for imbalanced data
Alimoradi, Mahmoud
Sadeghi, Reza
Daliri, Arman
Zabihimayvan, Mahdieh
NEUROCOMPUTING, 2025, 624
[24] An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data
Hao, Ming
Wang, Yanli
Bryant, Stephen H.
ANALYTICA CHIMICA ACTA, 2014, 806 : 117 - 127
[25] Breast Cancer Detection from Imbalanced Clinical Data: A Comparative Study of Sampling Methods
Bahrami, Mahsa
Vali, Mansour
Kia, Hanif
2023 30TH NATIONAL AND 8TH INTERNATIONAL IRANIAN CONFERENCE ON BIOMEDICAL ENGINEERING, ICBME, 2023, : 145 - 149
[26] A comprehensive data level analysis for cancer diagnosis on imbalanced data
Fotouhi, Sara
Asadi, Shahrokh
Kattan, Michael W.
JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 90
[27] Neighbourhood sampling in bagging for imbalanced data
Blaszczynski, Jerzy
Stefanowski, Jerzy
NEUROCOMPUTING, 2015, 150 : 529 - 542
[28] Statistical analysis of relative pose information of subcortical nuclei: Application on ADNI data
Bossa, Matias
Zacur, Ernesto
Olmos, Salvador
NEUROIMAGE, 2011, 55 (03) : 999 - 1008
[29] A Hybrid Sampling Method for Imbalanced Data
Gazzah, Sami
Hechkel, Amina
Ben Amara, Najoua Essoukri
2015 IEEE 12TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2015,
[30] Combining Re-sampling with Twin Support Vector Machine for Imbalanced Data Classification
Cao, Lu
Shen, Hong
2016 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2016, : 325 - 329

← 1 2 3 4 5 →