Reformulating the direct convolution for high-performance deep learning inference on ARM processors

被引:9
作者
Barrachina, Sergio [1 ]
Castello, Adrian [2 ]
Dolz, Manuel F. [1 ]
Low, Tze Meng [3 ]
Martinez, Hector [4 ]
Quintana-Orti, Enrique S. [2 ]
Sridhar, Upasana [3 ]
Tomas, Andres E. [1 ]
机构
[1] Univ Jaume 1, Castellon De La Plana, Spain
[2] Univ Politecn Valencia, Valencia, Spain
[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[4] Univ Cordoba, Cordoba, Spain
关键词
Convolution; Direct algorithm; Deep learning; High performance; ARMv8; architecture;
D O I
10.1016/j.sysarc.2022.102806
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.
引用
收藏
页数:13
相关论文
共 33 条
[1]  
Alaejos G., 2022, IEEE T COMPUT
[2]  
[Anonymous], 1998, HIGH PERFORMANCE COM
[3]  
[Anonymous], 2009, Parallel Solution of Integral Equation Based EM Problems in the Frequency Domain
[4]  
[Anonymous], 2015, OPENBLAS
[5]  
[Anonymous], 2006, P ICFHR
[6]   A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks [J].
Barrachina, Sergio ;
Castello, Adrian ;
Catalan, Mar ;
Dolz, Manuel F. ;
Mestre, Jose, I .
2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, :730-739
[7]   PyDTNN: A user-friendly and extensible framework for distributed deep learning [J].
Barrachina, Sergio ;
Castello, Adrian ;
Catalan, Mar ;
Dolz, Manuel F. ;
Mestre, Jose, I .
JOURNAL OF SUPERCOMPUTING, 2021, 77 (09) :9971-9987
[8]   Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis [J].
Ben-Nun, Tal ;
Hoefler, Torsten .
ACM COMPUTING SURVEYS, 2019, 52 (04)
[9]   High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS [J].
Castello, Adrian ;
Barrachina, Sergio ;
Dolz, Manuel F. ;
Quintana-Orti, Enrique S. ;
Juan, Pau San ;
Tomas, Andres E. .
JOURNAL OF SYSTEMS ARCHITECTURE, 2022, 125
[10]  
Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579