Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

被引:0
作者
Mannino, Mirco [1 ]
Peccerillo, Biagio [1 ]
Mondelli, Andrea [2 ]
Bartolini, Sandro [1 ]
机构
[1] Univ Siena, Dept Informat Engn & Math, I-53100 Siena, Italy
[2] Huawei Technol Co Ltd, Cambridge CB4 0WG, England
关键词
Convolutional neural networks; direct convolution; multi-core; multi-threading; performance evaluation;
D O I
10.1109/ACCESS.2023.3283312
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, direct convolution is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to 1.67 x speedup) than matrix-matrix multiplication-based convolution in a multi-core system.
引用
收藏
页码:57514 / 57528
页数:15
相关论文
共 30 条
  • [1] Performance portability in a real world application: PHAST applied to Caffe
    Antonio Martinez, Pablo
    Peccerillo, Biagio
    Bartolini, Sandro
    Garcia, Jose M.
    Bernabe, Gregorio
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2022, 36 (03) : 419 - 439
  • [2] Reformulating the direct convolution for high-performance deep learning inference on ARM processors
    Barrachina, Sergio
    Castello, Adrian
    Dolz, Manuel F.
    Low, Tze Meng
    Martinez, Hector
    Quintana-Orti, Enrique S.
    Sridhar, Upasana
    Tomas, Andres E.
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 135
  • [3] Chen P., 2019, P INT C HIGH PERF CO, P1
  • [4] Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
    Chen, Yu-Hsin
    Yange, Tien-Ju
    Emer, Joel S.
    Sze, Vivienne
    [J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) : 292 - 308
  • [5] DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning
    Chen, Yunji
    Chen, Tianshi
    Xu, Zhiwei
    Sun, Ninghui
    Temam, Olivier
    [J]. COMMUNICATIONS OF THE ACM, 2016, 59 (11) : 105 - 112
  • [6] Cho M, 2017, PR MACH LEARN RES, V70
  • [7] Dukhan M, 2019, Arxiv, DOI [arXiv:1907.02129, DOI 10.48550/ARXIV.1907.02129]
  • [8] Multi-view Face Detection Using Deep Convolutional Neural Networks
    Farfade, Sachin Sudhakar
    Saberian, Mohammad
    Li, Li-Jia
    [J]. ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 643 - 650
  • [9] Fused DSConv: Optimizing Sparse CNN Inference for Execution on Edge Devices
    Guo, Jia
    Teodorescu, Radu
    Agrawal, Gagan
    [J]. 21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 545 - 554
  • [10] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778