STVAI: Exploring spatio-temporal similarity for scalable and efficient intelligent video inference

被引:0
作者
Li, Chuang [1 ,2 ]
Wang, Heshi [1 ]
Wen, Yanhua [1 ,2 ]
Shi, Qingyu [1 ,2 ]
Wang, Qinyu [3 ]
Hu, Chunhua [1 ,2 ]
Wu, Dongchen [4 ]
机构
[1] Hunan Univ Technol & Business, Coll Comp Sci, Changsha 410205, Hunan, Peoples R China
[2] Xiangjiang Lab, Changsha 410205, Hunan, Peoples R China
[3] South China Univ Technol, Sch Future Technol, Guangzhou 510641, Guangdong, Peoples R China
[4] York Univ, Schulich Sch Business, Toronto, ON M3J 1P3, Canada
关键词
Convolutional neural network; Deep learning; CUDA programming; Video inference; Parallel computing;
D O I
10.1016/j.jpdc.2025.105079
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantly improved the accuracy of video inference, yet they present substantial challenges for real-time and large-scale computational demands. Existing researches primarily utilize the temporal similarity between video frames to reduce redundant computations, but most of them overlooked the spatial similarity within the frames themselves. Hence, we propose STVAI, a scalable and efficient method that leverages both spatial and temporal similarities to accelerate video inference. This approach uses a parallel region merging strategy, which maintains inference accuracy and enhances the sparsity of the computation matrix. Moreover, we have optimized the computation of sparse convolutions by utilizing Tensor Cores, which accelerate dense convolution computations based on the sparsity of the tiles. Experimental results demonstrate that STVAI achieves a stable acceleration of 1.25 times faster than cuDNN implementations, with only a 5% decrease in prediction accuracy. STVAI can achieve accelerations up to 1.53x, surpassing that of existing methods. Our method can be directly applied to various CNN architectures for video inference tasks without the need for retraining the model.
引用
收藏
页数:10
相关论文
共 52 条
[1]   Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey [J].
Abdu, Sarah A. ;
Yousef, Ahmed H. ;
Salem, Ashraf .
INFORMATION FUSION, 2021, 76 :204-226
[2]   LSFQ: A Low Precision Full Integer Quantization for High-Performance FPGA-based CNN Acceleration [J].
Bao, Zhenshan ;
Zhan, Kang ;
Zhang, Wenbo ;
Guo, Junnan .
2021 IEEE COOL CHIPS 24: IEEE SYMPOSIUM IN LOW-POWER AND HIGH-SPEED CHIPS, 2021,
[3]   End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].
Botach, Adam ;
Zheltonozhskii, Evgenii ;
Baskin, Chaim .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985
[4]   CBinfer: Exploiting Frame-to-Frame Locality for Faster Convolutional Network Inference on Video Streams [J].
Cavigelli, Lukas ;
Benini, Luca .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (05) :1451-1465
[5]  
Chen C., 2024, IEEE Transactions on Consumer Electronics
[6]   Point Cloud Acceleration by Exploiting Geometric Similarity [J].
Chen, Cen ;
Zou, Xiaofeng ;
Shao, Hongen ;
Li, Yangfan ;
Li, Kenli .
56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023, 2023, :1135-1147
[7]   Median Filtering Forensics Based on Convolutional Neural Networks [J].
Chen, Jiansheng ;
Kang, Xiangui ;
Liu, Ye ;
Wang, Z. Jane .
IEEE SIGNAL PROCESSING LETTERS, 2015, 22 (11) :1849-1853
[8]  
Chen X, 2013, PROCEEDINGS OF THE 2013 FOURTH INTERNATIONAL CONFERENCE ON INTELLIGENT CONTROL AND INFORMATION PROCESSING (ICICIP), P514
[9]   TempDiff: Temporal Difference-Based Feature Map-Level Sparsity Induction in CNNs with <4% Memory Overhead [J].
De Alwis, Udari ;
Alioto, Massimo .
2021 IEEE 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS), 2021,
[10]  
Faliu Yi, 2012, 2012 International Conference on Systems and Informatics (ICSAI 2012), P1936, DOI 10.1109/ICSAI.2012.6223428