HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

被引:0
作者
Ding, Yifan [1 ]
Botzer, Nicholas [1 ]
Weninger, Tim [1 ]
机构
[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
来源
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷
关键词
GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory capacity and compute performance. Unfortunately, most organizations, especially universities, have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be used to compute large models. The present work describes HetSeq, a software package adapted from the popular PyTorch package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with language translation, text and image classification shows that HetSeq scales over heterogeneous systems. Additional information, support documents, source code are publicly available at https://github.com/yifding/hetseq.
引用
收藏
页码:15432 / 15438
页数:7
相关论文
共 29 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Berner C., 2019, 191206680 ARXIV
  • [3] Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
  • [4] Chen X., 2018, IEEE TPDS, V30, P646
  • [5] Cipolla R, 2019, ICLR, P1
  • [6] Dean J., 2012, Advances in neural information processing systems (NeurIPS 2012), P1232
  • [7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [8] Harlap Aaron., 2018, CoRR, abs/1806.03377
  • [9] He Kaiming, 2015, C COMP VIS PATT REC
  • [10] Huang YP, 2019, ADV NEUR IN, V32