Reformulating the direct convolution for high-performance deep learning inference on ARM processors

被引:10
作者
Barrachina, Sergio [1 ]
Castello, Adrian [2 ]
Dolz, Manuel F. [1 ]
Low, Tze Meng [3 ]
Martinez, Hector [4 ]
Quintana-Orti, Enrique S. [2 ]
Sridhar, Upasana [3 ]
Tomas, Andres E. [1 ]
机构
[1] Univ Jaume 1, Castellon De La Plana, Spain
[2] Univ Politecn Valencia, Valencia, Spain
[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[4] Univ Cordoba, Cordoba, Spain
关键词
Convolution; Direct algorithm; Deep learning; High performance; ARMv8; architecture;
D O I
10.1016/j.sysarc.2022.102806
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.
引用
收藏
页数:13
相关论文
共 33 条
[1]  
Alaejos G., 2022, IEEE T COMPUT
[2]  
[Anonymous], 2015, OPENBLAS
[3]   A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks [J].
Barrachina, Sergio ;
Castello, Adrian ;
Catalan, Mar ;
Dolz, Manuel F. ;
Mestre, Jose, I .
2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, :730-739
[4]   PyDTNN: A user-friendly and extensible framework for distributed deep learning [J].
Barrachina, Sergio ;
Castello, Adrian ;
Catalan, Mar ;
Dolz, Manuel F. ;
Mestre, Jose, I .
JOURNAL OF SUPERCOMPUTING, 2021, 77 (09) :9971-9987
[5]   Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis [J].
Ben-Nun, Tal ;
Hoefler, Torsten .
ACM COMPUTING SURVEYS, 2019, 52 (04)
[6]   High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS [J].
Castello, Adrian ;
Barrachina, Sergio ;
Dolz, Manuel F. ;
Quintana-Orti, Enrique S. ;
Juan, Pau San ;
Tomas, Andres E. .
JOURNAL OF SYSTEMS ARCHITECTURE, 2022, 125
[7]  
Chellapilla K., 2006, P ICFHR
[8]  
Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
[9]  
DONGARRA JJ, 1990, ACM T MATH SOFTWARE, V16, P1, DOI 10.1145/77626.79170
[10]  
Dowd K., 1998, High Performance Computing, Vsecond