Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform

被引:26
|
作者
Herrera, Victor M. [1 ]
Khoshgoftaar, Taghi M. [1 ]
Villanustre, Flavio [2 ]
Furht, Borko [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, Boca Raton, FL 33431 USA
[2] LexisNexis Risk Solut, Alpharetta, GA 30005 USA
基金
美国国家科学基金会;
关键词
Random forest; LexisNexis's high performance computing cluster (HPCC) systems platform; Optimization for Big Data; Distributed machine learning; Turning recursion into iteration; CLASSIFICATION;
D O I
10.1186/s40537-019-0232-1
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest's learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC's programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node's best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform's Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.
引用
收藏
页数:36
相关论文
共 14 条
  • [1] Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform
    Victor M. Herrera
    Taghi M. Khoshgoftaar
    Flavio Villanustre
    Borko Furht
    Journal of Big Data, 6
  • [2] Hyperparameters Optimization in Scalable Random Forest For Big Data Analytics
    Oo, Myal Cho Mon
    Thein, Thandar
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 125 - 129
  • [3] Implementation and performance optimization of dynamic random forest
    Xu, Xiaolong
    Chen, Wen
    2017 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC), 2017, : 283 - 289
  • [4] A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment
    Chen, Jianguo
    Li, Kenli
    Tang, Zhuo
    Bilal, Kashif
    Yu, Shui
    Weng, Chuliang
    Li, Keqin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (04) : 919 - 933
  • [5] A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
    Cliff, Ashley
    Romero, Jonathon
    Kainer, David
    Walker, Angelica
    Furches, Anna
    Jacobson, Daniel
    GENES, 2019, 10 (12)
  • [6] Modifying Cleaning Method in Big Data Analytics Process using Random Forest Classifier
    Hossen, J.
    Jesmeen, M. Z. H.
    Sayeed, Shohel
    PROCEEDINGS OF THE 2018 7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE), 2018, : 208 - 213
  • [7] Preliminary Big Data Analytics of Hepatitis Disease by Random Forest and SVM Using R- Tool
    Lakshmi, Visali P. R.
    Shwetha, G.
    Raja, N. Madhava
    2017 THIRD INTERNATIONAL CONFERENCE ON BIOSIGNALS, IMAGES AND INSTRUMENTATION (ICBSII), 2017,
  • [8] BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data
    Olaniran, Oyebayo Ridwan
    Bin Abdullah, Mohd Asrul Affendi
    ROMANIAN STATISTICAL REVIEW, 2018, (01) : 95 - 102
  • [9] Mapping potential wetlands by a new framework method using random forest algorithm and big Earth data: A case study in China's Yangtze River Basin
    Xiang, Hengxing
    Xi, Yanbiao
    Mao, Dehua
    Mahdianpari, Masoud
    Zhang, Jian
    Wang, Ming
    Jia, Mingming
    Yu, Fudong
    Wang, Zongming
    GLOBAL ECOLOGY AND CONSERVATION, 2023, 42
  • [10] Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies
    Kurc, Tahsin
    Qi, Xin
    Wang, Daihou
    Wang, Fusheng
    Teodoro, George
    Cooper, Lee
    Nalisnik, Michael
    Yang, Lin
    Saltz, Joel
    Foran, David J.
    BMC BIOINFORMATICS, 2015, 16