Ensemble learning based predictive modelling on a highly imbalanced multiclass data

被引:0
作者
Vasti, Manka [1 ,2 ]
Dev, Amita [3 ]
机构
[1] Guru Gobind Singh Indraprastha Univ, Univ Sch Informat Commun & Technol, New Delhi 110078, India
[2] GD Goenka Univ, Sch Engn & Sci, Dept Comp Sci & Engn, Gurugram 122103, Haryana, India
[3] Directorate Training & Tech Educ, Delhi 110034, India
关键词
Ensemble learning; Data augmentation; Earthquake prediction; Cluster based undersam-; CLASSIFICATION; PERFORMANCE;
D O I
10.47974/JIOS-1778
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Class imbalance in the real-world datasets is a big challenge and the domains such as fraud detection, calamity occurrences, bankruptcy prediction etc. are prone to class imbalance due to the nature of occurrences of the events. In this paper, the detailed research using six ensemble machine learning techniques is applied to the undersampled, oversampled and the original dataset and the results are compared. The results of the research study indicates that amongst the applied six ensemble learners, the best learner is Random Forest algorithm (with entropy gain) implemented using ten-fold cross validation on the SMOTE oversampled dataset. 0.95 AUC and 0.8689 accuracy i.e. an increase of 4% in accuracy and substantial
引用
收藏
页码:2141 / 2164
页数:24
相关论文
共 34 条
[1]  
김태훈, 2015, [Journal of Intelligence and Information Systems, 지능정보연구], V21, P173
[2]   An Empirical Study on Class Rarity in Big Data [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. ;
Hasanin, Tawfiq .
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, :785-790
[3]   The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data [J].
Bauder, Richard A. ;
Khoshgoftaar, Taghi M. .
HEALTH INFORMATION SCIENCE AND SYSTEMS, 2018, 6
[4]  
Chujai P., 2015, P 3 INT C IND APPL E
[5]   Combating imbalance in network intrusion datasets [J].
Cieslak, David A. ;
Chawla, Nitesh V. ;
Striegel, Aaron .
2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, :732-+
[6]   Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data [J].
Deng, Fei ;
Huang, Jibing ;
Yuan, Xiaoling ;
Cheng, Chao ;
Zhang, Lanjing .
LABORATORY INVESTIGATION, 2021, 101 (04) :430-441
[7]   On the Class Imbalance Problem [J].
Guo, Xinjian ;
Yin, Yilong ;
Dong, Cailing ;
Yang, Gongping ;
Zhou, Guangtong .
ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 4, PROCEEDINGS, 2008, :192-201
[8]  
Hamid HA, 2022, INT J ADV COMPUT SC, V13, P211
[9]   Ensemble learning using fast rule based fuzzy K -means pre clustering and classification for aquatic behavior-extracted tsunami prediction [J].
Jain, Nikita ;
Virmani, Deepali .
JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2019, 40 (02) :441-453
[10]   A Modified DBSCAN Algorithm for Anomaly Detection in Time-series Data with [J].
Jain, Praphula ;
Bajpai, Mani Shankar ;
Pamula, Rajendra .
INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2022, 19 (01) :23-28