Runtime Data Layout Scheduling for Machine Learning Dataset

被引：5

作者：

You, Yang ^{[1
]}

Demmel, James ^{[1
]}

机构：

[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA

来源：

2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) | 2017年

关键词：

parallel auto-tuning; machine learning;

D O I：

10.1109/ICPP.2017.54

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Machine Learning (ML) approaches are widely-used classification/regression methods for data mining applications. However, the time-consuming training process greatly limits the efficiency of ML approaches. We use the example of SVM (traditional ML algorithm) and DNN (state-of-the-art ML algorithm) to illustrate the idea in this paper. For SVM, a major performance bottleneck of current tools is that they use a unified data storage format because the data formats can have a significant influence on the complexity of storage and computation, memory bandwidth, and the efficiency of parallel processing. To address the problem above, we study the factors influencing the algorithm's performance and conduct auto-tuning to speed up SVM training. DNN training is even slower than SVM. For example, using a 8-core CPUs to train AlexNet model by CIFAR-10 dataset costs 8.2 hours. CIFAR-10 is only 170 MB, which is not efficient for distributed processing. Moreover, due to the algorithm limitation, only a small batch of data can be processed at each iteration. We focus on finding the right algorithmic parameters and using auto-tuning techniques to make the algorithm run faster. For SVM training, our implementation achieves 1.7-16.3x speedup (6.8x on average) against the non-adaptive case (using the worst data format) for various datasets. For DNN training on CIFAR-10 dataset, we reduce the time from 8.2 hours to only roughly 1 minute. We use the benchmark of dollars per speedup to help the users to select the right deep learning hardware.

引用

页码：452 / 461

页数：10

共 50 条

[1] Advanced Machine Learning for Runtime Data Generation
Zamir, Bukhtawar
Campos, Joao R.
Vieira, Marco
PROCEEDINGS OF12TH LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE AND SECURE COMPUTING, LADC 2023, 2023, : 182 - 187
[2] Adaptive OpenMP Task Scheduling Using Runtime APIs and Machine Learning
Qawasmeh, Ahmad R.
Malik, Abid M.
Chapman, Barbara M.
2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 889 - 895
[3] A machine learning dataset for FRB detection in raw data
Xu, ZhiJun
An, Tao
Guo, ShaoGuang
Lao, BaoQiang
Lv, WeiJia
Wu, XiaoCong
SCIENTIA SINICA-PHYSICA MECHANICA & ASTRONOMICA, 2023, 53 (02)
[4] Reintroducing KAPD as a Dataset for Machine Learning and Data Mining Applications
Seddiq, Yasser
Meftah, Ali
Alghamdi, Mansour
Alotaibi, Yousef
UKSIM-AMSS 10TH EUROPEAN MODELLING SYMPOSIUM ON COMPUTER MODELLING AND SIMULATION (EMS), 2016, : 70 - 74
[5] FOWD: A Free Ocean Wave Dataset for Data Mining and Machine Learning
Hafner, Dion
Gemmrich, Johannes
Jochum, Markus
JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY, 2021, 38 (07) : 1305 - 1322
[6] Exploration of Machine Learning and Data Mining techniques on a horse racing dataset
Kyriacou, E
Toolan, F
Dunnion, J
MLMTA '05: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING MODELS TECHNOLOGIES AND APPLICATIONS, 2005, : 161 - 166
[7] A survey on dataset quality in machine learning
Gong, Youdi
Liu, Guangzhen
Xue, Yunzhi
Li, Rui
Meng, Lingzhong
INFORMATION AND SOFTWARE TECHNOLOGY, 2023, 162
[8] RMLIM: A Runtime Machine Learning Based Identification Model for Approximate Computing on Data Flow Graphs
Wang, Ye
Dong, Jian
Liu, Yanxin
Wang, Chunpei
Qu, Gang
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, 2022, 7 (01): : 201 - 210
[9] Citizens' data afterlives: Practices of dataset inclusion in machine learning for public welfare
Ratner, Helene Friis
Thylstrup, Nanna Bonde
AI & SOCIETY, 2024, 40 (3) : 1183 - 1193
[10] Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models
Ahmed, Nasim
Barczak, Andre L. C.
Rashid, Mohammad A.
Susnjak, Teo
JOURNAL OF BIG DATA, 2022, 9 (01)

← 1 2 3 4 5 →