Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUs

被引：0

作者：

Zhang, Zhiyi ^{[1
]}

Zhang, Pengfei ^{[2
]}

Xu, Zhuopin ^{[2
]}

Yan, Bingjie ^{[3
]}

Wang, Qi ^{[2
]}

机构：

[1] Chinese Acad Sci, Hefei Inst Phys Sci, Univ Sci & Technol China, Hefei, Peoples R China

[2] Chinese Acad Sci, Hefei Inst Phys Sci, Hefei, Peoples R China

[3] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

来源：

53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

Winograd; Convolutional Neural Networks; GPU; Performance;

D O I：

10.1145/3673038.3673039

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Compared to standard convolution, Winograd algorithm has lower time complexity and can accelerate the execution of convolutional neural networks. Previous studies have utilized Winograd to implement 2D convolution on GPUs, mainly using 2D Winograd, and arranging tensors in NCHW or CHWN format instead of NHWC to make data access coalesced. Due to the higher space complexity of Winograd and limited hardware resources, these implementations are usually confined to small filters. To provide an efficient and flexible fused-Winograd convolution for NHWC format on GPUs, we propose Im2col-Winograd. This algorithm decomposes an ND convolution into a series of 1D convolutions to utilize 1D Winograd, thereby reducing space complexity and data-access discontinuity. The reduced space complexity makes Im2col-Winograd less restricted by hardware capability, enabling it to accommodate a wider range of filter shapes. Our implementations support 2-9 filter widths and do not use any workspace to store intermediate variables. According to the experiments, Im2col-Winograd achieves a speedup of 0.788x to 2.05x over the fastest benchmark algorithm in cuDNN; and shows similar convergence to PyTorch on Cifar10 and ILSVRC2012 datasets. Along with memory efficiency, the more generalized acceleration offered by Im2col-Winograd can be beneficial for extracting features at different convolution scales.

引用

页码：1072 / 1081

页数：10

共 34 条

[1] OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA [J].

Castro, Roberto L. ;

Andrade, Diego ;

Fraguela, Basilio B. .

MATHEMATICS, 2021, 9 (17)

[2]

Chetlur S, 2014, Arxiv, DOI arXiv:1410.0759

[3]

Cook SA, 1966, On the minimum computation time for multiplication

[4]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[5]

Goodfellow Ian, 2016, Deep Learning

[6] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1026-1034

[7] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[8]

Ioffe S, 2015, PR MACH LEARN RES, V37, P448

[9] Caffe: Convolutional Architecture for Fast Feature Embedding [J].

Jia, Yangqing ;

Shelhamer, Evan ;

Donahue, Jeff ;

Karayev, Sergey ;

Long, Jonathan ;

Girshick, Ross ;

Guadarrama, Sergio ;

Darrell, Trevor .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :675-678

[10] Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs [J].

Jia, Zhen ;

Zlateski, Aleksandar ;

Durand, Fredo ;

Li, Kai .

ACM SIGPLAN NOTICES, 2018, 53 (01) :109-123

← 1 2 3 4 →