Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

被引：0

作者：

Mannino, Mirco ^{[1
]}

Peccerillo, Biagio ^{[1
]}

Mondelli, Andrea ^{[2
]}

Bartolini, Sandro ^{[1
]}

机构：

[1] Univ Siena, Dept Informat Engn & Math, I-53100 Siena, Italy

[2] Huawei Technol Co Ltd, Cambridge CB4 0WG, England

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Convolutional neural networks; direct convolution; multi-core; multi-threading; performance evaluation;

D O I：

10.1109/ACCESS.2023.3283312

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, direct convolution is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to 1.67 x speedup) than matrix-matrix multiplication-based convolution in a multi-core system.

引用

页码：57514 / 57528

页数：15

共 30 条

[11] Intel, 2015, MATH KERN LIB
[12] Jeong Hwancheol, 2012, Performance of SSE and AVX instruction sets
[13] Jia Yangqing, 2014, arXiv
[14] Kalchbrenner N, 2014, Arxiv, DOI [arXiv:1404.2188, DOI 10.3115/V1/P14-1062]
[15] Design and Implementation of 2D Convolution on x86/x64 Processors
Kelefouras, Vasilios
Keramidas, Georgios
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 3800 - 3815
[16] A survey of the recent architectures of deep convolutional neural networks
Khan, Asifullah
Sohail, Anabia
Zahoora, Umme
Qureshi, Aqsa Saeed
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2020, 53 (08) : 5455 - 5516
[17] ImageNet Classification with Deep Convolutional Neural Networks
Krizhevsky, Alex
Sutskever, Ilya
Hinton, Geoffrey E.
[J]. COMMUNICATIONS OF THE ACM, 2017, 60 (06) : 84 - 90
[18] Kukunas Jim., 2015, Power and performance: Software analysis and optimization
[19] A Survey of Data Mining and Deep Learning in Bioinformatics
Lan, Kun
Wang, Dan-tong
Fong, Simon
Liu, Lian-sheng
Wong, Kelvin K. L.
Dey, Nilanjan
[J]. JOURNAL OF MEDICAL SYSTEMS, 2018, 42 (08)
[20] Li R., 2021, P 26 ACM INT C ARCH

← 1 2 3 →