Reformulating the direct convolution for high-performance deep learning inference on ARM processors

被引：10

作者：

Barrachina, Sergio ^{[1
]}

Castello, Adrian ^{[2
]}

Dolz, Manuel F. ^{[1
]}

Low, Tze Meng ^{[3
]}

Martinez, Hector ^{[4
]}

Quintana-Orti, Enrique S. ^{[2
]}

Sridhar, Upasana ^{[3
]}

Tomas, Andres E. ^{[1
]}

机构：

[1] Univ Jaume 1, Castellon De La Plana, Spain

[2] Univ Politecn Valencia, Valencia, Spain

[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[4] Univ Cordoba, Cordoba, Spain

来源：

JOURNAL OF SYSTEMS ARCHITECTURE | 2023年 / 135卷

关键词：

Convolution; Direct algorithm; Deep learning; High performance; ARMv8; architecture;

D O I：

10.1016/j.sysarc.2022.102806

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.

引用

页数：13

共 33 条

[1]

Alaejos G., 2022, IEEE T COMPUT

[2]

[Anonymous], 2015, OPENBLAS

[3] A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks [J].