Efficient Video Captioning on Heterogeneous System Architectures

被引：3

作者：

Huang, Horng-Ruey ^{[1
]}

Hong, Ding-Yong ^{[1
]}

Wu, Jan-Jan ^{[1
]}

Liu, Pangfeng ^{[2
]}

Hsu, Wei-Chung ^{[2
]}

机构：

[1] Acad Sinica, Inst Informat Sci, Taipei, Taiwan

[2] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei, Taiwan

来源：

2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) | 2021年

关键词：

Video captioning; heterogeneous system architectures; model scheduling; dynamic programming; pipelining;

D O I：

10.1109/IPDPS49936.2021.00112

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video captioning is the core technology to drive the development of many important multidisciplinary applications, such as Al-assisted medical diagnosis, storytelling through videos, video question answering, lip-reading, just to name a few. Video captioning employs a hybrid CNN+RNN neural network model to translate video scenes into natural language descriptions. For deep learning inference, a typical approach is running both the CNN and the RNN on a GPU. Such a GPU-only approach often suffers long inference time due to underutilization of the computing power offered by the CPU+GPU heterogeneous system architecture, which is a common architecture in modern computers. This work is an early effort to tackle the performance issue of performing deep learning inference using a hybrid CNN+RNN model on a heterogeneous system with a CPU and a GPU. This is a challenging task because of (1) CNN and RNN exhibit very different computing behaviors. This raises the question of how to split the two models into computing tasks and properly assign the tasks to the CPU and the GPU to minimize the inference time for a video frame, and (2) Data dependency exists between the CNN and the RNN within a video frame, as well as between the adjacent RNNs across two video frames. These data dependencies prohibit full parallelization of the hybrid model. To solve these two problems, we propose two optimizations: a fine-grained scheduling scheme for mapping computation and devices within a video frame, and a pipeline scheduling scheme to exploit maximum parallelism between the execution ()I' the video frames. To facilitate our optimizations, we also develop an accurate regression-based cost model to predict the computation time of CNN/RNN operations and the communication time for moving data between CPU and GPU. Experimental results show that our optimization improves the performance of video captioning by up to 3.24x on the CPU+GPU system, compared with the GPU-only execution.

引用

页码：1035 / 1045

页数：11

共 50 条

[41] UAT: Universal Attention Transformer for Video Captioning
Im, Heeju
Choi, Yong-Suk
SENSORS, 2022, 22 (13)
[42] Chained semantic generation network for video captioning
Mao L.
Gao H.
Yang D.
Zhang R.
Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2022, 30 (24): : 3198 - 3209
[43] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[44] Deep Reinforcement Polishing Network for Video Captioning
Xu, Wanru
Yu, Jian
Miao, Zhenjiang
Wan, Lili
Tian, Yi
Ji, Qiang
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 (23) : 1772 - 1784
[45] Adaptive semantic guidance network for video captioning☆
Liu, Yuanyuan
Zhu, Hong
Wu, Zhong
Du, Sen
Wu, Shuning
Shi, Jing
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 251
[46] Semantic similarity information discrimination for video captioning
Du, Sen
Zhu, Hong
Xiong, Ge
Lin, Guangfeng
Wang, Dong
Shi, Jing
Wang, Jing
Xing, Nan
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 213
[47] Learning Hierarchical Modular Networks for Video Captioning
Li, Guorong
Ye, Hanhua
Qi, Yuankai
Wang, Shuhui
Qing, Laiyun
Huang, Qingming
Yang, Ming-Hsuan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (02) : 1049 - 1064
[48] Spotting and Aggregating Salient Regions for Video Captioning
Wang, Huiyun
Xu, Youjiang
Han, Yahong
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1519 - 1526
[49] Video Captioning with Guidance of Multimodal Latent Topics
Chen, Shizhe
Chen, Jia
Jin, Qin
Hauptmann, Alexander
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
[50] SibNet: Sibling Convolutional Encoder for Video Captioning
Liu, Sheng
Ren, Zhou
Yuan, Junsong
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1425 - 1434

← 1 2 3 4 5 →