Classification of Imbalanced Data Using SMOTE and AutoEncoder Based Deep Convolutional Neural Network

被引:8
作者
Alex, Suja A. [1 ]
Nayahi, J. Jesu Vedha [2 ]
机构
[1] St Xaviers Catholic Coll Engn, Informat Technol, Nagercoil, India
[2] Anna Univ Reg Campus, Comp Sci & Engn, Tirunelveli, India
关键词
Unbalanced data; SMOTE; deep learning; AutoEncoder; convolutional neural network; FEATURE-SELECTION; DIABETES DISEASE; SAMPLING METHOD; K-MEANS; ALGORITHM; MACHINE; MODEL; LSTM; CLASSIFIERS;
D O I
10.1142/S0218488523500228
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The imbalanced data classification is a challenging issue in many domains including medical intelligent diagnosis and fraudulent transaction analysis. The performance of the conventional classifier degrades due to the imbalanced class distribution of the training data set. Recently, machine learning and deep learning techniques are used for imbalanced data classification. Data preprocessing approaches are also suitable for handling class imbalance problem. Data augmentation is one of the preprocessing techniques used to handle skewed class distribution. Synthetic Minority Oversampling Technique (SMOTE) is a promising class balancing approach and it generates noise during the process of creation of synthetic samples. In this paper, AutoEncoder is used as a noise reduction technique and it reduces the noise generated by SMOTE. Further, Deep one-dimensional Convolutional Neural Network is used for classification. The performance of the proposed method is evaluated and compared with existing approaches using different metrics such as Precision, Recall, Accuracy, Area Under the Curve and Geometric Mean. Ten data sets with imbalance ratio ranging from 1.17 to 577.87 and data set size ranging from 303 to 284807 instances are used in the experiments. The different imbalanced data sets used are Heart-Disease, Mammography, Pima Indian diabetes, Adult, Oil-Spill, Phoneme, Creditcard, BankNoteAuthentication, Balance scale weight & distance database and Yeast data sets. The proposed method shows an accuracy of 96.1%, 96.5%, 87.7%, 87.3%, 95%, 92.4%, 98.4%, 86.1%, 94% and 95.9% respectively. The results suggest that this method outperforms other deep learning methods and machine learning methods with respect to G-mean and other performance metrics.
引用
收藏
页码:437 / 469
页数:33
相关论文
共 86 条
[1]   An optimized model using LSTM network for demand forecasting [J].
Abbasimehr, Hossein ;
Shabani, Mostafa ;
Yousefi, Mohsen .
COMPUTERS & INDUSTRIAL ENGINEERING, 2020, 143
[2]   Intelligent Medical Disease Diagnosis Using Improved Hybrid Genetic Algorithm - Multilayer Perceptron Network [J].
Ahmad, Fadzil ;
Isa, Nor Ashidi Mat ;
Hussain, Zakaria ;
Osman, Muhammad Khusairi .
JOURNAL OF MEDICAL SYSTEMS, 2013, 37 (02)
[3]   Deep convolutional neural network for diabetes mellitus prediction [J].
Alex, Suja A. ;
Nayahi, J. Jesu Vedha ;
Shine, H. ;
Gopirekha, Vaisshalli .
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (02) :1319-1327
[4]  
Alizadeh-dizaj G, 2018, RISK PREDICTION STRA
[5]  
[Anonymous], 2008, P 25 INT C MACHINE L, DOI DOI 10.1145/1390156.1390224
[6]  
[Anonymous], 2003, P INT C ARTIFICIAL N
[7]   SMOTE-LOF for noise identification in imbalanced data classification [J].
Asniar, Nur Ulfa ;
Maulidevi, Nur Ulfa ;
Surendro, Kridanto .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (06) :3413-3423
[8]  
Awoyemi JO, 2017, PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTING NETWORKING AND INFORMATICS (ICCNI 2017)
[9]  
Badriyah T., 2020, INT IEEE C ELECT COM, P1
[10]  
Bisong E., 2019, BUILDING MACHINE LEA, P215