TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

被引：24

作者：

Cai, Zhenkun ^{[1
]}

Yan, Xiao ^{[2
]}

Ma, Kaihao ^{[1
]}

Wu, Yidi ^{[1
]}

Huang, Yuzhen ^{[1
]}

Cheng, James ^{[1
]}

Su, Teng ^{[3
]}

Yu, Fan ^{[3
]}

机构：

[1] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen 518055, Guangdong, Peoples R China

[3] Huawei Technol Co Ltd, Shenzhen 518129, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 08期

关键词：

Training; Deep learning; Adaptation models; Memory management; Search problems; Encoding; distributed systems; large-scale model training;

D O I：

10.1109/TPDS.2021.3132413

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Effective parallelization strategies are crucial for the performance of distributed deep neural network (DNN) training. Recently, several methods have been proposed to search parallelization strategies but they all optimize a single objective (e.g., execution time, memory consumption) and produce only one strategy. We propose Frontier Tracking (FT), an efficient algorithm that finds a set of Pareto-optimal parallelization strategies to explore the best trade-off among different objectives. FT can minimize the memory consumption when the number of devices is limited and fully utilize additional resources to reduce the execution time. Based on FT, we develop a user-friendly system, called TensorOpt, which allows users to run their distributed DNN training jobs without caring the details about searching and coding parallelization strategies. Experimental results show that TensorOpt is more flexible in adapting to resource availability compared with existing frameworks.

引用

页码：1967 / 1981

页数：15

共 46 条

[1]

Abadi M, 2016, ACM SIGPLAN NOTICES, V51, P1, DOI [10.1145/2951913.2976746, 10.1145/3022670.2976746]

[2]

[Anonymous], 2018, CoRR abs/1802.05799

[3]

Bae J, 2021, PROCEEDINGS OF THE 19TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST '21), P387

[4]

Bahdanau D., 2015, ICLR

[5]

Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)

[6] DGCL: An Efficient Communication Library for Distributed GNN Training [J].

Cai, Zhenkun ;

Yan, Xiao ;

Wu, Yidi ;

Ma, Kaihao ;

Cheng, James ;

Yu, Fan .

PROCEEDINGS OF THE SIXTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS '21), 2021, :130-144

[7]

Canziani A., 2016, arXiv preprint arXiv:1605.07678, DOI DOI 10.48550/ARXIV.1605.07678

[8]

Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579

[9] High Prevalence of Assisted Injection Among Street-Involved Youth in a Canadian Setting [J].

Cheng, Tessa ;

Kerr, Thomas ;

Small, Will ;

Dong, Huiru ;

Montaner, Julio ;

Wood, Evan ;

DeBeck, Kora .

AIDS AND BEHAVIOR, 2016, 20 (02) :377-384

[10]

Dean J, 2012, Advances in neural information processing systems, P1223

← 1 2 3 4 5 →