Improving Random Forest and Rotation Forest for highly imbalanced datasets

被引:43
作者
Su, Chong [1 ,2 ]
Ju, Shenggen [1 ]
Liu, Yiguang [1 ]
Yu, Zhonghua [1 ]
机构
[1] Sichuan Univ, Dept Comp, Chengdu 610065, Sichuan, Peoples R China
[2] Nanjing Jiangbei Peoples Hosp, Informat Ctr, Nanjing, Jiangsu, Peoples R China
关键词
Random Forest; Rotation Forest; Hellinger distance; Hellinger distance decision tree (HDDT); highly imbalanced datasets; STATISTICAL COMPARISONS; CLASSIFICATION; CLASSIFIERS;
D O I
10.3233/IDA-150789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Decision tree is a simple and effective method and it can be supplemented with ensemble methods to improve its performance. Random Forest and Rotation Forest are two approaches which are perceived as "classic" at present. They can build more accurate and diverse classifiers than Bagging and Boosting by introducing the diversities namely randomly chosen a subset of features or rotated feature space. However, the splitting criteria used for constructing each tree in Random Forest and Rotation Forest are Gini index and information gain ratio respectively, which are skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. Hellinger distance decision tree (HDDT) was proposed by Chawla, which is skew-insensitive. Especially, bagged unpruned HDDT has proven to be an effective way to deal with highly imbalanced problem. Nevertheless, the bootstrap sampling used in Bagging can lead to ensembles of low diversity compared to Random Forest and Rotation Forest. In order to combine the skew-insensitivity of HDDT and the diversities of Random Forest and Rotation Forest, we use Hellinger distance as the splitting criterion for building each tree in Random Forest and Rotation Forest respectively. An experimental framework is performed across a wide range of highly imbalanced datasets to investigate the effectiveness of Hellinger distance, information gain ratio and Gini index which are used as the splitting criteria in ensembles of decision trees including Bagging, Boosting, Random Forest and Rotation Forest. In addition, Balanced Random Forest is also included in the experiment since it is designed to tackle class imbalance problem. The experimental results, which contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance as the splitting criterion to build individual decision tree in forest can improve the performances of Random Forest and Rotation Forest for highly imbalanced classification.
引用
收藏
页码:1409 / 1432
页数:24
相关论文
共 50 条
[31]   Classification Using Random Forest on Imbalanced Credit Card Transaction Data [J].
Aktar, Hafija ;
Masud, Md Abdul ;
Aunto, Nusrat Jahan ;
Sakib, Syed Nazmus .
2021 3RD INTERNATIONAL CONFERENCE ON SUSTAINABLE TECHNOLOGIES FOR INDUSTRY 4.0 (STI), 2021,
[32]   Imbalanced data classification based on DB-SLSMOTE and random forest [J].
Han, Qi ;
Yang, Rui ;
Wan, Zitong ;
Chen, Shaozhi ;
Huang, Mengjie ;
Wen, Huiqing .
2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, :6271-6276
[33]   Investigation of Random Subspace and Random Forest Methods Applied to Property Valuation Data [J].
Lasota, Tadeusz ;
Luczak, Tomasz ;
Trawinski, Bogdan .
COMPUTATIONAL COLLECTIVE INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS, PT I, 2011, 6922 :142-+
[34]   Imbalanced educational data classification: an effective approach with resampling and random forest [J].
Vo Thi Ngoc Chau ;
Nguyen Hua Phung .
PROCEEDINGS OF 2013 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2013, :135-140
[35]   Applying a Random Forest Approach to Imbalanced Dataset on Network Monitoring Analysis [J].
Chen, Qian ;
Zhang, Xing ;
Wang, Ying ;
Zhai, Zhijia ;
Yang, Fen .
CYBER SECURITY, CNCERT 2022, 2022, 1699 :28-37
[36]   MARGIN-BASED RANDOM FOREST FOR IMBALANCED LAND COVER CLASSIFICATION [J].
Feng, W. ;
Boukir, S. ;
Huang, W. .
2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, :3085-3088
[37]   Tree Species Classification in a Temperate Mixed Mountain Forest Landscape Using Random Forest and Multiple Datasets [J].
Hologa, Rafael ;
Scheffczyk, Konstantin ;
Dreiser, Christoph ;
Gaertner, Stefanie .
REMOTE SENSING, 2021, 13 (22)
[38]   Enhancing Feature Selection for Imbalanced Alzheimer's Disease Brain MRI Images by Random Forest [J].
Wang, Xibin ;
Zhou, Qiong ;
Li, Hui ;
Chen, Mei .
APPLIED SCIENCES-BASEL, 2023, 13 (12)
[39]   Advanced EOR screening methodology based on LightGBM and random forest: A classification problem with imbalanced data [J].
Seyyedattar, Masoud ;
Afshar, Majid ;
Zendehboudi, Sohrab ;
Butt, Stephen .
CANADIAN JOURNAL OF CHEMICAL ENGINEERING, 2025, 103 (02) :846-867
[40]   Using the rotation and random forest models of ensemble learning to predict landslide susceptibility [J].
Zhao, Lingran ;
Wu, Xueling ;
Niu, Ruiqing ;
Wang, Ying ;
Zhang, Kaixiang .
GEOMATICS NATURAL HAZARDS & RISK, 2020, 11 (01) :1542-1564