Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding

被引：2

作者：

Wang, Lipeng ^{[1
]}

Luo, Qiong ^{[1
]}

Yan, Shengen ^{[2
]}

机构：

[1] HKUST, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] SenseTime Res, Shenzhen, Peoples R China

来源：

2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS) | 2020年

关键词：

deep learning; image decoding; parallel processing; heterogeneous processing; GPU;

D O I：

10.1109/ICPADS51040.2020.00045

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In computer vision deep learning (DL) tasks, most of the input image datasets are stored in the JPEG format. These JPEG datasets need to be decoded before DL tasks are performed on them. We observe two problems in the current JPEG decoding procedures for DL tasks: (1) the decoding of image entropy data in the decoder is performed sequentially, and this sequential decoding repeats with the DL iterations, which takes significant time; (2) Current parallel decoding methods under-utilize the massive hardware threads on GPUs. To reduce the image decoding time, we introduce a pre-scan mechanism to avoid the repeated image scanning in DL tasks. Our pre-scan generates boundary markers for entropy data so that the decoding can be performed in parallel. To cooperate with the existing dataset storage and caching, systems, we propose two modes of the pre-scan mechanism: a compatible mode and a fist mode. The compatible mode does not change the image file structure so pre-scanned files can be stored back to disk for subsequent DL tasks. In comparison, the fast mode crafts a JPEG image into a binary format suitable for parallel decoding, which can be processed directly on the GPU. Since the GPU has thousands of hardware threads, we propose a fine-grained parallel decoding method on the pre-scanned dataset. The fine-grained parallelism utilizes the GPU effectively, and achieves speedups of around 1.5x over existing GPU-assisted image decoding libraries on real-world DL tasks.

引用

页码：274 / 281

页数：8

共 30 条

[1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2] Ahmadi-Shokouh J., 2009, GLOBECOM '09, P1
[3] Aspar Z., 2000, PARALLEL HUFFMAN DEC, V1
[4] High Prevalence of Assisted Injection Among Street-Involved Youth in a Canadian Setting
Cheng, Tessa
Kerr, Thomas
Small, Will
Dong, Huiru
Montaner, Julio
Wood, Evan
DeBeck, Kora
[J]. AIDS AND BEHAVIOR, 2016, 20 (02) : 377 - 384
[5] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines
Cheng, Yang
Li, Dan
Guo, Zhiyuan
Jiang, Binyao
Lin, Jiaxin
Fan, Xi
Geng, Jinkun
Yu, Xinyi
Bai, Wei
Qu, Lei
Shu, Ran
Cheng, Peng
Xiong, Yongqiang
Wu, Jianping
[J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[6] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[7] GPU-accelerated DXT and JPEG compression schemes for low-latency network transmissions of HD, 2K, and 4K video
Holub, Petr
Srom, Martin
Pulec, Martin
Matela, Jiri
Jirman, Martin
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (08): : 1991 - 2006
[8] JPEG-1 standard 25 years: past, present, and future reasons for a success
Hudson, Graham
Leger, Alain
Niss, Birger
Sebestyen, Istvan
Vaaben, Jorgen
[J]. JOURNAL OF ELECTRONIC IMAGING, 2018, 27 (04)
[9] A METHOD FOR THE CONSTRUCTION OF MINIMUM-REDUNDANCY CODES
HUFFMAN, DA
[J]. PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, 1952, 40 (09): : 1098 - 1101
[10] Junming Shan, 2011, 2011 Asia Pacific Conference on Postgraduate Research in Microelectronics & Electronics, P57, DOI 10.1109/PrimeAsia.2011.6075070

← 1 2 3 →