Improving Random Forest and Rotation Forest for highly imbalanced datasets

被引:38
|
作者
Su, Chong [1 ,2 ]
Ju, Shenggen [1 ]
Liu, Yiguang [1 ]
Yu, Zhonghua [1 ]
机构
[1] Sichuan Univ, Dept Comp, Chengdu 610065, Sichuan, Peoples R China
[2] Nanjing Jiangbei Peoples Hosp, Informat Ctr, Nanjing, Jiangsu, Peoples R China
关键词
Random Forest; Rotation Forest; Hellinger distance; Hellinger distance decision tree (HDDT); highly imbalanced datasets; STATISTICAL COMPARISONS; CLASSIFICATION; CLASSIFIERS;
D O I
10.3233/IDA-150789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Decision tree is a simple and effective method and it can be supplemented with ensemble methods to improve its performance. Random Forest and Rotation Forest are two approaches which are perceived as "classic" at present. They can build more accurate and diverse classifiers than Bagging and Boosting by introducing the diversities namely randomly chosen a subset of features or rotated feature space. However, the splitting criteria used for constructing each tree in Random Forest and Rotation Forest are Gini index and information gain ratio respectively, which are skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. Hellinger distance decision tree (HDDT) was proposed by Chawla, which is skew-insensitive. Especially, bagged unpruned HDDT has proven to be an effective way to deal with highly imbalanced problem. Nevertheless, the bootstrap sampling used in Bagging can lead to ensembles of low diversity compared to Random Forest and Rotation Forest. In order to combine the skew-insensitivity of HDDT and the diversities of Random Forest and Rotation Forest, we use Hellinger distance as the splitting criterion for building each tree in Random Forest and Rotation Forest respectively. An experimental framework is performed across a wide range of highly imbalanced datasets to investigate the effectiveness of Hellinger distance, information gain ratio and Gini index which are used as the splitting criteria in ensembles of decision trees including Bagging, Boosting, Random Forest and Rotation Forest. In addition, Balanced Random Forest is also included in the experiment since it is designed to tackle class imbalance problem. The experimental results, which contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance as the splitting criterion to build individual decision tree in forest can improve the performances of Random Forest and Rotation Forest for highly imbalanced classification.
引用
收藏
页码:1409 / 1432
页数:24
相关论文
共 50 条
  • [1] Improving undersampling-based ensemble with rotation forest for imbalanced problem
    Guo, Huaping
    Diao, Xiaoyu
    Liu, Hongbing
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2019, 27 (02) : 1371 - 1386
  • [2] Improving Rotation Forest Performance for Imbalanced Data Classification through Fuzzy Clustering
    Hosseinzadeh, Mehrdad
    Eftekhari, Mahdi
    2015 INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING (AISP), 2015, : 35 - 40
  • [3] Embedding Undersampling Rotation Forest for Imbalanced Problem
    Guo, Huaping
    Diao, Xiaoyu
    Liu, Hongbing
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2018, 2018
  • [4] Predicting disease risks from highly imbalanced data using random forest
    Mohammed Khalilia
    Sounak Chakraborty
    Mihail Popescu
    BMC Medical Informatics and Decision Making, 11
  • [5] Predicting disease risks from highly imbalanced data using random forest
    Khalilia, Mohammed
    Chakraborty, Sounak
    Popescu, Mihail
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2011, 11
  • [6] Balanced random forest for imbalanced data streams
    Yagci, A. Murat
    Aytekin, Tevfik
    Gurgen, Fikret S.
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1065 - 1068
  • [7] Handling imbalanced datasets through Optimum-Path Forest
    Passos, Leandro Aparecido S.
    Jodas, Danilo S.
    Ribeiro, Luiz C. F.
    Akio, Marco
    De Souza, Andre Nunes
    Papa, Joao Paulo
    KNOWLEDGE-BASED SYSTEMS, 2022, 242
  • [8] Rotation forest of random subspace models
    Alexandropoulos, Stamatios-Aggelos N.
    Aridas, Christos K.
    Kotsiantis, Sotiris B.
    Gravvanis, George A.
    Vrahatis, Michael N.
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2022, 16 (02): : 315 - 324
  • [9] Oblique and rotation double random forest
    Ganaie, M. A.
    Tanveer, M.
    Suganthan, P. N.
    Snasel, V.
    NEURAL NETWORKS, 2022, 153 : 496 - 517
  • [10] Crack Random Forest for Arbitrary Large Datasets
    Lulli, Alessandro
    Oneto, Luca
    Anguita, Davide
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 706 - 715