Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

被引：0

作者：

Yu, Fuxun ^{[1
]}

Xu, Zirui ^{[2
]}

Shangguan, Longfei ^{[3
]}

Wang, Di ^{[1
]}

Stamoulis, Dimitrios ^{[1
]}

Madhok, Rishi ^{[1
]}

Karianakis, Nikolaos ^{[1
]}

Li, Ang ^{[4
]}

Liu, Chenchen ^{[5
]}

Chen, Yiran ^{[6
]}

Chen, Xiang ^{[7
,8
]}

机构：

[1] Microsoft Corp, Dept Res & Dev, Redmond, WA 98052 USA

[2] CVS Hlth Corp, Dept Res & Dev, Woonsocket, RI 02895 USA

[3] Univ Pittsburgh, Dept Comp Sci, Pittsburgh, PA 15261 USA

[4] Univ Maryland Coll Pk, Dept Elect & Comp Engn, College Pk, MD 20742 USA

[5] Univ Maryland Baltimore Cty, Dept Comp Sci & Elect Engn, Baltimore, MD 21250 USA

[6] Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA

[7] George Mason Univ, Dept Elect & Comp Engn, Fairfax, VA 22030 USA

[8] Peking Univ, Sch Comp Sci, Beijing 100871, Peoples R China

来源：

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS | 2025年 / 44卷 / 01期

关键词：

Graphics processing units; Optimization; Tail; Runtime; Computational modeling; Artificial neural networks; Hardware; AI accelerators; artificial neural networks; hardware acceleration;

D O I：

10.1109/TCAD.2024.3404413

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the size of deep neural networks (DNNs) continues to grow, their runtime latency also scales. While model pruning and neural architecture search (NAS) can effectively reduce the computation workload, their effectiveness fails to consistently translate into runtime latency reduction. In this article, we identify the root cause behind the mismatch between workload reduction and latency reduction is general processing unit (GPU) tail effect-a classic system issue caused by resource underutilization in the last processing wave of the GPU. We conduct detailed DNN workload characterization and demonstrate the prevalence of GPU tail effect across different DNN architectures, and meanwhile reveal that the unique deep structure and the lightweight layer workload of DNNs exacerbate the tail effect for DNN inference. We then propose a tail-awareness design space enhancement and DNN optimization algorithm to optimize existing NAS and pruning designs and achieve better runtime latency and model accuracy performance. Extensive experiments show 11%-27% latency reduction over SOTA DNN pruning and NAS methods.

引用

页码：266 / 279

页数：14

共 18 条

[1] Latency-aware automatic CNN channel pruning with GPU runtime analysis
Liu J.
Sun J.
Xu Z.
Sun G.
BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2021, 1 (01):
[2] Latency-Aware Packet Processing on CPU-GPU Heterogeneous Systems
Maghazeh, Arian
Bordoloi, Unmesh D.
Dastgeer, Usman
Andrei, Alexandru
Eles, Petru
Peng, Zebo
PROCEEDINGS OF THE 2017 54TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2017,
[3] CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU
Zhu, Yuting
Jiang, Flongxu
Zhang, Runhua
Zhang, Yonghua
Dong, Dong
2022 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING, ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM, 2022, : 355 - 362
[4] Enabling Latency-Aware Data Initialization for Integrated CPU/GPU Heterogeneous Platform
Wang, Zhendong
Jiang, Zihang
Wang, Zhen
Tang, Xulong
Liu, Cong
Yin, Shouyi
Hu, Yang
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (11) : 3433 - 3444
[5] Joint Fault Tolerant and Latency-Aware Design of Multilayer Optical Networks
Pedreno-Manresa, Jose-Juan
Izquierdo-Zaragoza, Jose-Luis
Pavon-Marino, Pablo
20TH INTERNATIONAL CONFERENCE ON OPTICAL NETWORK DESIGN AND MODELING (ONDM 2016), 2016,
[6] Hierarchical Matching with Peer Effect for Latency-Aware Caching in Social IoT
Wang, Bowen
Sun, Yanjing
Li, Song
Cao, Qi
Chen, Yan
Xu, Jie
2018 IEEE INTERNATIONAL CONFERENCE ON SMART INTERNET OF THINGS (SMARTIOT 2018), 2018, : 255 - 262
[7] The Design and Implementation of a Latency-Aware Packet Classification for OpenFlow Protocol based on FPGA
Chiu, Yu-Kai
Ruan, Shanq-Jang
Shen, Chung-An
Hung, Chun-Chi
PROCEEDINGS OF 2018 VII INTERNATIONAL CONFERENCE ON NETWORK, COMMUNICATION AND COMPUTING (ICNCC 2018), 2018, : 64 - 69
[8] Performance Analysis of Latency-Aware Data Management in Industrial IoT Networks
Raptis, Theofanis P.
Passarella, Andrea
Conti, Marco
SENSORS, 2018, 18 (08)
[9] Design of Latency-Aware IoT Modules in Heterogeneous Fog-Cloud Computing Networks
Hassan, Syed Rizwan
Ahmad, Ishtiaq
Nebhen, Jamel
Rehman, Ateeq Ur
Shafiq, Muhammad
Choi, Jin-Ghoo
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (03): : 6057 - 6072
[10] RobinHood: Tail Latency-Aware Caching - Dynamically Reallocating from Cache-Rich to Cache-Poor
Berger, Daniel S.
Berg, Benjamin
Zhu, Timothy
Harchol-Balter, Mor
Sen, Siddhartha
PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2018, : 195 - 212

← 1 2 →