Efficient Distributed Mapping-Based Computation for Convolutional Neural Networks in Multi-Core Embedded Parallel Environment

被引:0
作者
Jia, Long [1 ]
Li, Gang [2 ]
Lu, Meili [3 ]
Wei, Xile [4 ]
Yi, Guosheng [4 ]
机构
[1] Beihang Univ, Sch Automat Sci & Elect Engn, Beijing 100191, Peoples R China
[2] Beijing Aerosp Automat Control Inst, Beijing 100854, Peoples R China
[3] Tianjin Univ Technol & Educ, Sch Informat Technol Engn, Tianjin 300222, Peoples R China
[4] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Proc Measurement & Control, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
edge computing; convolutional neural network; parallel computing; embedded platform; distributed mapping; HARDWARE;
D O I
10.3390/electronics12183747
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Embedded systems are the best solution to achieve high-performance edge terminal computing tasks. With the rapid increase in the amount of data generated by edge devices, it is imperative to implement intelligent algorithms with large amounts of data and computation on embedded terminal systems. In this paper, a novel multi-core ARM-based embedded hardware platform with a three-dimensional mesh structure was first established to support the decentralized algorithms. To deploy deep convolutional neural networks (CNNs) in this embedded parallel environment, a distributed mapping mechanism was proposed to efficiently decentralize computation tasks in the form of a multi-branch assembly line. In addition, a dimensionality reduction initialization method was also utilized to successfully resolve the conflict between the storage requirement of computation tasks and the limited physical memories. LeNet-5 networks with different sizes were optimized and implemented in the embedded platform to verify the performance of our proposed strategies. The results showed that memory usage can be controlled within the usable range through dimensionality reduction. The down-sampling layer as the base point of the mapping for the inter-layer segmentation achieved the optimal operation in lateral dispersion with a reduction of around 10% in the running time compared with the other layers. Further, the computing speed for a network with an input size of 105 x 105 in the multi-core parallel environment is nearly 20 times faster than that in a single-core system. This paper provided a feasible strategy for edge deployments of artificial intelligent algorithms on multi-core embedded devices.
引用
收藏
页数:18
相关论文
共 26 条
[1]   A View of Cloud Computing [J].
Armbrust, Michael ;
Fox, Armando ;
Griffith, Rean ;
Joseph, Anthony D. ;
Katz, Randy ;
Konwinski, Andy ;
Lee, Gunho ;
Patterson, David ;
Rabkin, Ariel ;
Stoica, Ion ;
Zaharia, Matei .
COMMUNICATIONS OF THE ACM, 2010, 53 (04) :50-58
[2]  
Bao L., 2018, P IEEE COMP SOC C CO
[3]  
Bertinetto L., 2016, LECT NOTES ARTIFICIA, VVolume 5866 LNAI
[4]   Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Emer, Joel ;
Sze, Vivienne .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :367-379
[5]  
Dayhoff J.E., 1991, Choice Rev. Online, DOI [10.5860/choice.28-3323, DOI 10.5860/CHOICE.28-3323]
[6]   Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [J].
Deng, Lei ;
Li, Guoqi ;
Han, Song ;
Shi, Luping ;
Xie, Yuan .
PROCEEDINGS OF THE IEEE, 2020, 108 (04) :485-532
[7]   Learning Hierarchical Features for Scene Labeling [J].
Farabet, Clement ;
Couprie, Camille ;
Najman, Laurent ;
LeCun, Yann .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1915-1929
[8]  
Fukagai T., 2018, P INT C IM PROC ICIP
[9]   Object detection via a multi-region & semantic segmentation-aware CNN model [J].
Gidaris, Spyros ;
Komodakis, Nikos .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1134-1142
[10]  
Huang J., 2018, P 10 INT C MEAS TECH, VVolume 2018-January