Assembly language and assembler for deep learning accelerators

被引:0
作者
Lan H. [1 ,2 ]
Wu L. [1 ,2 ]
Han D. [1 ,2 ]
Du Z. [1 ,3 ]
机构
[1] Intelligent Processor Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
[2] University of Chinese Academy of Sciences, Beijing
[3] Cambricon Technologies Corporation Limited, Beijing
关键词
Assembly language; Deep learning; Deep learning accelerator (DLA); Programming language;
D O I
10.3772/j.issn.1006-6748.2019.04.006
中图分类号
学科分类号
摘要
Deep learning accelerators (DLAs) have been proved to be efficient computational devices for processing deep learning algorithms. Various DLA architectures are proposed and applied to different applications and tasks. However, for most DLAs, their programming interfaces are either difficult to use or not efficient enough. Most DLAs require programmers to directly write instructions, which is time-consuming and error-prone. Another prevailing programming interface for DLAs is high-performance libraries and deep learning frameworks, which are easy to be used and very friendly to users, but their high abstraction level limits their control capacity over the hardware resources thus compromises the efficiency of the accelerator. A design of the programming interface is for DLAs. First various existing DLAs and their programming methods are analyzed and a methodology for designing programming interface for DLAs is proposed, which is a high-level assembly language (called DLA-AL), assembler and runtime for DLAs. DLA-AL is composed of a low-level assembly language and a set of high-level blocks. It allows experienced experts to fully exploit the potential of DLAs and achieve near-optimal performance. Meanwhile, by using DLA-AL, end-users who have little knowledge of the hardware are able to develop deep learning algorithms on DLAs spending minimal programming efforts. Copyright © by HIGH TECHNOLOGY LETTERS PRESS.
引用
收藏
页码:386 / 394
页数:8
相关论文
共 16 条
[1]  
He K., Zhang X., Ren S., Et al., Deep residual learning for image recognition, Proceedings of Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[2]  
Devlin J., Chang M., Lee K., Et al., BERT: Pre-training of deep bidirectional transformers for language understanding, (2019)
[3]  
Abadi M., Barham P., Chen J., Et al., TensorFlow: a system for large-scale machine learning, (2016)
[4]  
Chen T., Li M., Li Y., Et al., MxNet: A flexible and flexible and efficient machine learning library for heterogeneous distributed systems, (2015)
[5]  
Chen T., Moreau T., Jiang Z., Et al., TVM: an automated end-to-end optimizing compiler for deep learning, (2018)
[6]  
Chetlur S., Woolley C., Vandermersch P., Et al., cuDNN: efficient primitives for deep learning, (2014)
[7]  
Chen T., Du Z., Sun N., Et al., DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning, Architectural Support for Programming Languages and Operating Systems, pp. 269-284, (2014)
[8]  
Du Z., Fasthuber R., Chen T., Et al., ShiDianNao: Shifting Vision Processing Closer to the sensor, International Symposium on Computer Architecture, 43, 3, pp. 92-104, (2015)
[9]  
Jouppi N.P., Young C.S., Patil N., Et al., In-Datacenter Performance Analysis of a Tensor Processing Unit, International Symposium on Computer Architecture, pp. 1-12, (2017)
[10]  
Liu S., Du Z., Tao J., Et al., Cambricon: an instruction set architecture for neural networks, International Symposium on Computer Architecture, pp. 393-405, (2016)