High Performance and Power Efficient Accelerator for Cloud Inference

被引:3
|
作者
Yao, Jianguo [1 ,2 ]
Zhou, Hao [2 ]
Zhang, Yalin [2 ]
Li, Ying [2 ]
Feng, Chuang [2 ]
Chen, Shi [2 ]
Chen, Jiaoyan [2 ]
Wang, Yongdong [2 ]
Hu, Qiaojuan [2 ]
机构
[1] SJTU, Shanghai, Peoples R China
[2] Enflame Tech Inc, Shanghai, Peoples R China
来源
2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年
关键词
REGISTER FILE; ARCHITECTURE; TIME;
D O I
10.1109/HPCA56546.2023.10070941
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Facing the growing complexity of Deep Neural Networks (DNNs), high-performance and power-efficient AI accelerators are desired to provide effective and affordable cloud inference services. We introduce our flagship product, i.e., the Cloudblazer i20 accelerator, which integrates the innovated Deep Thinking Unit (DTU 2.0). The design is driven by requests drawn from various AI inference applications and insights learned from our previous products. With careful tradeoffs in hardware-software co-design, Cloudblazer i20 delivers impressive performance and energy efficiency while maintaining acceptable hardware costs and software complexity/flexibility. To tackle computation- and data-intensive workloads, DTU 2.0 integrates powerful vector/matrix engines and a large-capacity multi-level memory hierarchy with high bandwidth. It supports comprehensive data flow and synchronization patterns to fully exploit parallelism in computation/memory access within or among concurrent tasks. Moreover, it enables sparse data compression/decompression, data broadcasting, repeated data transfer, and kernel code prefetching to optimize bandwidth utilization and reduce data access overheads. To utilize the underlying hardware and simplify the development of customized DNNs/operators, the software stack enables automatic optimizations (such as operator fusion and data flow tuning) and provides diverse programming interfaces for developers. Lastly, the energy consumption is optimized through dynamic power integrity and efficiency management, eliminating integrity risks and energy wastes. Based on the performance requirement, developers also can assign their workloads with the entire or partial hardware resources accordingly. Evaluated with 10 representative DNN models widely adopted in various domains, Cloudblazer i20 outperforms Nvidia T4 and A10 GPUs with a geometric mean of 2.22x and 1.16x in performance and 1.04x and 1.17x in energy efficiency, respectively. The improvements demonstrate the effectiveness of Cloudblazer i20's design that emphasizes performance, efficiency, and flexibility.
引用
收藏
页码:1003 / 1016
页数:14
相关论文
共 50 条
  • [1] An Efficient FPGA Accelerator Optimized for High Throughput Sparse CNN Inference
    Wen, Jiayu
    Ma, Yufei
    Wang, Zhongfeng
    APCCAS 2020: PROCEEDINGS OF THE 2020 IEEE ASIA PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS (APCCAS 2020), 2020, : 165 - 168
  • [2] AIX: A high performance and energy efficient inference accelerator on FPGA for a DNN-based commercial speech recognition
    Ahn, Minwook
    Hwang, Seok Joong
    Kim, Wonsub
    Jung, Seungrok
    Lee, Yeonbok
    Chung, Mookyoung
    Lim, Woohyung
    Kim, Youngjoon
    2019 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2019, : 1495 - 1500
  • [3] An Efficient FPGA Accelerator for Point Cloud
    Wang, Zilun
    Mao, Wendong
    Yang, Peixiang
    Wang, Zhongfeng
    Lin, Jun
    2022 IEEE 35TH INTERNATIONAL SYSTEM-ON-CHIP CONFERENCE (IEEE SOCC 2022), 2022, : 310 - 315
  • [4] A High-performance Inference Accelerator Exploiting Patterned Sparsity in CNNs
    Li, Ning
    Liu, Leibo
    Wei, Shaojun
    Yin, Shouyi
    28TH IEEE INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2020, : 243 - 243
  • [5] PointAcc: Efficient Point Cloud Accelerator
    Lin, Yujun
    Zhang, Zhekai
    Tang, Haotian
    Wang, Hanrui
    Han, Song
    PROCEEDINGS OF 54TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2021, 2021, : 449 - 461
  • [6] Work-in-Progress: A Power-Efficient and High Performance FPGA Accelerator for Convolutional Neural Networks
    Gong, Lei
    Wang, Chao
    Li, Xi
    Chen, Huaping
    Zhou, Xuehai
    2017 INTERNATIONAL CONFERENCE ON HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS (CODES+ISSS), 2017,
  • [7] SPRINT: A High-Performance, Energy-Efficient, and Scalable Chiplet-Based Accelerator With Photonic Interconnects for CNN Inference
    Li, Yuan
    Louri, Ahmed
    Karanth, Avinash
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (10) : 2332 - 2345
  • [8] CNN inference simulator for accurate and efficient accelerator design
    Choi, Seong Bin
    Lee, Sang Seol
    Jang, Sung Joon
    2019 INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC), 2019, : 283 - 284
  • [9] High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection
    Zhang, Gang
    Zhang, Chaofan
    Wang, Fan
    Tang, Fulin
    Wu, Yihong
    Yang, Xuezhi
    Liu, Yong
    PATTERN RECOGNITION AND COMPUTER VISION, PT I, 2021, 13019 : 117 - 128
  • [10] High-Performance Mixed-Low-Precision CNN Inference Accelerator on FPGA
    Wang, Junbin
    Fang, Shaoxia
    Wang, Xi
    Ma, Jiangsha
    Wang, Taobo
    Shan, Yi
    IEEE MICRO, 2021, 41 (04) : 31 - 38