Optimizing Machine Learning on Apache Spark in HPC Environments

被引:0
|
作者
Li, Zhenyu [1 ]
Davis, James [1 ]
Jarvis, Stephen A. [1 ]
机构
[1] Univ Warwick, Dept Comp Sci, Coventry, W Midlands, England
来源
PROCEEDINGS OF 2018 IEEE/ACM MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC 2018) | 2018年
基金
英国工程与自然科学研究理事会;
关键词
Machine Learning; High Performance Computing; Apache Spark; All-Reduce; Asynchronous Stochastic Gradient Descent; MAPREDUCE;
D O I
10.1109/MLHPC.2018.00006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91% compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.
引用
收藏
页码:95 / 105
页数:11
相关论文
共 50 条
  • [21] Performance Analysis of Machine Learning Techniques on Big Data Using Apache Spark
    Mogha, Garima
    Ahlawat, Khyati
    Singh, Amit Prakash
    DATA SCIENCE AND ANALYTICS, 2018, 799 : 17 - 26
  • [22] Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark
    Harnie, Dries
    Vapirev, Alexander E.
    Wegner, Jorg Kurt
    Gedich, Andrey
    Steijaert, Marvin
    Wuyts, Roel
    De Meuter, Wolfgang
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 871 - 879
  • [23] A COMPARISON OF MACHINE LEARNING TECHNIQUES FOR ANDROID MALWARE DETECTION USING APACHE SPARK
    Memon, Laraib U.
    Bawany, Narmeen Z.
    Shamsi, Jawwad A.
    JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2019, 14 (03): : 1572 - 1586
  • [24] Characterizing Distributed Machine Learning Workloads on Apache Spark (Experimentation and Deployment Paper)
    Djebrouni, Yasmine
    Rocha, Isabelly
    Bouchenak, Sara
    Chen, Lydia
    Felber, Pascal
    Marangozova, Vania
    Schiavoni, Valerio
    PROCEEDINGS OF THE 24TH ACM/IFIP INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2023, 2023, : 151 - 164
  • [25] Scaling machine learning for target prediction in drug discovery using Apache Spark
    Harnie, Dries
    Saey, Mathijs
    Vapirev, Alexander E.
    Wegner, Jorg Kurt
    Gedich, Andrey
    Steijaert, Marvin
    Ceulemans, Hugo
    Wuyts, Roel
    De Meuter, Wolfgang
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 67 : 409 - 417
  • [26] Performance evaluation of intrusion detection based on machine learning using Apache Spark
    Belouch, Mustapha
    El Hadaj, Salah
    Idhammad, Mohamed
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING IN DATA SCIENCES (ICDS2017), 2018, 127 : 1 - 6
  • [27] Architecture for the Execution of Tasks in Apache Spark in Heterogeneous Environments
    Serrano, Estefania
    Garcia Blas, Javier
    Carretero, Jesus
    Abella, Monica
    EURO-PAR 2016: PARALLEL PROCESSING WORKSHOPS, 2017, 10104 : 504 - 515
  • [28] Design and Evaluation of Scalable Intrusion Detection System Using Machine Learning and Apache Spark
    Yogesh, K.
    Karthik, M.
    Naveen, T.
    Saravanan, S.
    2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [29] Research on Visual Machine Learning Algorithms Based on Apache Spark in Big Data Environment
    Wang, Jialin
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 144 - 144
  • [30] Effective Selection of Machine Learning Algorithms for Big Data Analytics Using Apache Spark
    Hafez, Manar Mohamed
    Shehab, Mohamed Elemam
    El Fakharany, Essam
    Hegazy, Abd El Ftah Abdel Ghfar
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 692 - 704