Efficient Distributed Parallel Inference Strategies via Block-based DNN Structure in Edge-to-IoT Continuum

被引:0
|
作者
Choi, Inhun [1 ]
Akhter, Sharmen [1 ]
Jeong, Hong-Ju [2 ]
Huh, Eui-Nam [1 ]
机构
[1] Kyung Hee Univ, Dept Comp Sci & Engn, Gyeonggi Do 17104, South Korea
[2] Kyung Hee Univ, Dept Artificial Intelligence, Yeonggi Do 17104, South Korea
关键词
DNN inference; Block-based DNN network; trade-off between latency; and accuracy; ACCELERATION;
D O I
10.1145/3654522.3654598
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, AI and deep neural networks have found extensive applications in mobile devices, drones, carts, and more. To meet the demands of processing large-scale data and providing DNN inference services with minimal latency, there is a need. However, IoT devices, with their limited computing capabilities, are not well-suited for AI inference. Moreover, considering the diverse requirements of different services, it is necessary to provide inference services that address these variations. To address these challenges, many previous studies have explored collaborative approaches between edge servers and cloud servers by partitioning DNN models. However, these methods face difficulties in finding optimal partitioning points for splitting DNN models and are heavily influenced by network bandwidth since intermediate computation results need to be transmitted to other devices. In this paper, we propose the Adaptive block-based DNN network inference framework. This involves breaking down a large DNN model into block-level networks, training them using knowledge distillation techniques to enable inference only through each block network. Subsequently, dynamic block-level inference calculations are offloaded based on the computing capabilities of edge clusters, providing inference results. Even when using multiple devices, our method is not affected by network bandwidth since only input images need to be transmitted. Experimental results demonstrate that our approach consistently reduces inference latency as the number of devices increases. Additionally, by controlling the trade-off between latency and accuracy, we can provide inference services tailored to various latency requirements.
引用
收藏
页码:505 / 511
页数:7
相关论文
共 1 条
  • [1] eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference
    Huang, Chao-Tsung
    Ding, Yu-Chun
    Wang, Huan-Ching
    Weng, Chi-Wen
    Lin, Kai-Ping
    Wang, Li-Wei
    Chen, Li-De
    MICRO'52: THE 52ND ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 2019, : 182 - 195