FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning

被引：0

作者：

Cai, Fenglong ^{[1
,2
]}

Yuan, Dong ^{[3
]}

Yang, Zhe ^{[1
,2
]}

Xu, Yonghui ^{[1
,2
]}

He, Wei ^{[1
,2
]}

Guo, Wei ^{[1
,2
]}

Cui, Lizhen ^{[1
,2
]}

机构：

[1] Shandong Univ, Sch Software, 1500 Shunhua Rd, Jinan 250101, Shandong, Peoples R China

[2] Shandong Univ, Joint SDU NTU Ctr Artificial Intelligence Res C FA, 1500 Shunhua Rd, Jinan 250101, Shandong, Peoples R China

[3] Univ Sydney, Sch Elect & Informat Engn, Sydney, NSW 2006, Australia

来源：

PARALLEL COMPUTING | 2024年 / 122卷

关键词：

Parallel inference; Pre-trained models; Service provisioning;

D O I：

10.1016/j.parco.2024.103114

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.

引用

页数：12

共 48 条

[1] DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale [J].

Aminabadi, Reza Yazdani ;

Rajbhandari, Samyam ;

Awan, Ammar Ahmad ;

Li, Cheng ;

Li, Du ;

Zheng, Elton ;

Ruwase, Olatunji ;

Smith, Shaden ;

Zhang, Minjia ;

Rasley, Jeff ;

He, Yuxiong .

SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,

[2]

[Anonymous], About us

[3]

[Anonymous], GPUDirect Storage

[4]

Bai ZH, 2020, PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), P499

[5]

Bao Hangbo, 2022, ICLR

[6]

Ben Noach M, 2020, 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), P884

[7]

devblog.pytorchlightning, About us

[8]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[9] Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels [J].

Gilman G. ;

Ogden S.S. ;

Guo T. ;

Walls R.J. .

Performance Evaluation Review, 2021, 48 (03) :81-88

[10]

Gujarati A, 2020, PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), P443

← 1 2 3 4 5 →