Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform

被引：26

作者：

Herrera, Victor M. ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

Villanustre, Flavio ^{[2
]}

Furht, Borko ^{[1
]}

机构：

[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, Boca Raton, FL 33431 USA

[2] LexisNexis Risk Solut, Alpharetta, GA 30005 USA

来源：

JOURNAL OF BIG DATA | 2019年 / 6卷 / 01期

基金：

美国国家科学基金会;

关键词：

Random forest; LexisNexis's high performance computing cluster (HPCC) systems platform; Optimization for Big Data; Distributed machine learning; Turning recursion into iteration; CLASSIFICATION;

D O I：

10.1186/s40537-019-0232-1

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest's learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC's programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node's best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform's Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.

引用

页数：36

共 14 条

[1] Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform
Victor M. Herrera
Taghi M. Khoshgoftaar
Flavio Villanustre
Borko Furht
Journal of Big Data, 6
[2] Hyperparameters Optimization in Scalable Random Forest For Big Data Analytics
Oo, Myal Cho Mon
Thein, Thandar
2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 125 - 129
[3] Implementation and performance optimization of dynamic random forest
Xu, Xiaolong
Chen, Wen
2017 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC), 2017, : 283 - 289
[4] A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment
Chen, Jianguo
Li, Kenli
Tang, Zhuo
Bilal, Kashif
Yu, Shui
Weng, Chuliang
Li, Keqin
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (04) : 919 - 933
[5] A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
Cliff, Ashley
Romero, Jonathon
Kainer, David
Walker, Angelica
Furches, Anna
Jacobson, Daniel
GENES, 2019, 10 (12)
[6] Modifying Cleaning Method in Big Data Analytics Process using Random Forest Classifier
Hossen, J.
Jesmeen, M. Z. H.
Sayeed, Shohel
PROCEEDINGS OF THE 2018 7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE), 2018, : 208 - 213
[7] Preliminary Big Data Analytics of Hepatitis Disease by Random Forest and SVM Using R- Tool
Lakshmi, Visali P. R.
Shwetha, G.
Raja, N. Madhava
2017 THIRD INTERNATIONAL CONFERENCE ON BIOSIGNALS, IMAGES AND INSTRUMENTATION (ICBSII), 2017,
[8] BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data
Olaniran, Oyebayo Ridwan
Bin Abdullah, Mohd Asrul Affendi
ROMANIAN STATISTICAL REVIEW, 2018, (01) : 95 - 102
[9] Mapping potential wetlands by a new framework method using random forest algorithm and big Earth data: A case study in China's Yangtze River Basin
Xiang, Hengxing
Xi, Yanbiao
Mao, Dehua
Mahdianpari, Masoud
Zhang, Jian
Wang, Ming
Jia, Mingming
Yu, Fudong
Wang, Zongming
GLOBAL ECOLOGY AND CONSERVATION, 2023, 42
[10] Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies
Kurc, Tahsin
Qi, Xin
Wang, Daihou
Wang, Fusheng
Teodoro, George
Cooper, Lee
Nalisnik, Michael
Yang, Lin
Saltz, Joel
Foran, David J.
BMC BIOINFORMATICS, 2015, 16

← 1 2 →