Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

被引:0
|
作者
Yu, Fuxun [1 ]
Xu, Zirui [2 ]
Shangguan, Longfei [3 ]
Wang, Di [1 ]
Stamoulis, Dimitrios [1 ]
Madhok, Rishi [1 ]
Karianakis, Nikolaos [1 ]
Li, Ang [4 ]
Liu, Chenchen [5 ]
Chen, Yiran [6 ]
Chen, Xiang [7 ,8 ]
机构
[1] Microsoft Corp, Dept Res & Dev, Redmond, WA 98052 USA
[2] CVS Hlth Corp, Dept Res & Dev, Woonsocket, RI 02895 USA
[3] Univ Pittsburgh, Dept Comp Sci, Pittsburgh, PA 15261 USA
[4] Univ Maryland Coll Pk, Dept Elect & Comp Engn, College Pk, MD 20742 USA
[5] Univ Maryland Baltimore Cty, Dept Comp Sci & Elect Engn, Baltimore, MD 21250 USA
[6] Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA
[7] George Mason Univ, Dept Elect & Comp Engn, Fairfax, VA 22030 USA
[8] Peking Univ, Sch Comp Sci, Beijing 100871, Peoples R China
关键词
Graphics processing units; Optimization; Tail; Runtime; Computational modeling; Artificial neural networks; Hardware; AI accelerators; artificial neural networks; hardware acceleration;
D O I
10.1109/TCAD.2024.3404413
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the size of deep neural networks (DNNs) continues to grow, their runtime latency also scales. While model pruning and neural architecture search (NAS) can effectively reduce the computation workload, their effectiveness fails to consistently translate into runtime latency reduction. In this article, we identify the root cause behind the mismatch between workload reduction and latency reduction is general processing unit (GPU) tail effect-a classic system issue caused by resource underutilization in the last processing wave of the GPU. We conduct detailed DNN workload characterization and demonstrate the prevalence of GPU tail effect across different DNN architectures, and meanwhile reveal that the unique deep structure and the lightweight layer workload of DNNs exacerbate the tail effect for DNN inference. We then propose a tail-awareness design space enhancement and DNN optimization algorithm to optimize existing NAS and pruning designs and achieve better runtime latency and model accuracy performance. Extensive experiments show 11%-27% latency reduction over SOTA DNN pruning and NAS methods.
引用
收藏
页码:266 / 279
页数:14
相关论文
共 18 条
  • [1] Latency-aware automatic CNN channel pruning with GPU runtime analysis
    Liu J.
    Sun J.
    Xu Z.
    Sun G.
    BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2021, 1 (01):
  • [2] Latency-Aware Packet Processing on CPU-GPU Heterogeneous Systems
    Maghazeh, Arian
    Bordoloi, Unmesh D.
    Dastgeer, Usman
    Andrei, Alexandru
    Eles, Petru
    Peng, Zebo
    PROCEEDINGS OF THE 2017 54TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2017,
  • [3] CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU
    Zhu, Yuting
    Jiang, Flongxu
    Zhang, Runhua
    Zhang, Yonghua
    Dong, Dong
    2022 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING, ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM, 2022, : 355 - 362
  • [4] Enabling Latency-Aware Data Initialization for Integrated CPU/GPU Heterogeneous Platform
    Wang, Zhendong
    Jiang, Zihang
    Wang, Zhen
    Tang, Xulong
    Liu, Cong
    Yin, Shouyi
    Hu, Yang
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (11) : 3433 - 3444
  • [5] Joint Fault Tolerant and Latency-Aware Design of Multilayer Optical Networks
    Pedreno-Manresa, Jose-Juan
    Izquierdo-Zaragoza, Jose-Luis
    Pavon-Marino, Pablo
    20TH INTERNATIONAL CONFERENCE ON OPTICAL NETWORK DESIGN AND MODELING (ONDM 2016), 2016,
  • [6] Hierarchical Matching with Peer Effect for Latency-Aware Caching in Social IoT
    Wang, Bowen
    Sun, Yanjing
    Li, Song
    Cao, Qi
    Chen, Yan
    Xu, Jie
    2018 IEEE INTERNATIONAL CONFERENCE ON SMART INTERNET OF THINGS (SMARTIOT 2018), 2018, : 255 - 262
  • [7] The Design and Implementation of a Latency-Aware Packet Classification for OpenFlow Protocol based on FPGA
    Chiu, Yu-Kai
    Ruan, Shanq-Jang
    Shen, Chung-An
    Hung, Chun-Chi
    PROCEEDINGS OF 2018 VII INTERNATIONAL CONFERENCE ON NETWORK, COMMUNICATION AND COMPUTING (ICNCC 2018), 2018, : 64 - 69
  • [8] Performance Analysis of Latency-Aware Data Management in Industrial IoT Networks
    Raptis, Theofanis P.
    Passarella, Andrea
    Conti, Marco
    SENSORS, 2018, 18 (08)
  • [9] Design of Latency-Aware IoT Modules in Heterogeneous Fog-Cloud Computing Networks
    Hassan, Syed Rizwan
    Ahmad, Ishtiaq
    Nebhen, Jamel
    Rehman, Ateeq Ur
    Shafiq, Muhammad
    Choi, Jin-Ghoo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (03): : 6057 - 6072
  • [10] RobinHood: Tail Latency-Aware Caching - Dynamically Reallocating from Cache-Rich to Cache-Poor
    Berger, Daniel S.
    Berg, Benjamin
    Zhu, Timothy
    Harchol-Balter, Mor
    Sen, Siddhartha
    PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2018, : 195 - 212