Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution

被引:3
作者
Sarraf, Saman [1 ,2 ]
Kabia, Milton [2 ]
机构
[1] Inst Elect & Elect Engineers, St Clara Valley Sect, Santa Clara, CA 94085 USA
[2] Natl Univ, Sch Technol & Engn, San Diego, CA 92123 USA
关键词
action recognition; vision transformer; cloud solution; INFERENCE; DEEP; ATTENTION;
D O I
10.3390/make5040067
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study introduces an optimal topology of vision transformers for real-time video action recognition in a cloud-based solution. Although model performance is a key criterion for real-time video analysis use cases, inference latency plays a more crucial role in adopting such technology in real-world scenarios. Our objective is to reduce the inference latency of the solution while admissibly maintaining the vision transformer's performance. Thus, we employed the optimal cloud components as the foundation of our machine learning pipeline and optimized the topology of vision transformers. We utilized UCF101, including more than one million action recognition video clips. The modeling pipeline consists of a preprocessing module to extract frames from video clips, training two-dimensional (2D) vision transformer models, and deep learning baselines. The pipeline also includes a postprocessing step to aggregate the frame-level predictions to generate the video-level predictions at inference. The results demonstrate that our optimal vision transformer model with an input dimension of 56 x 56 x 3 with eight attention heads produces an F1 score of 91.497% for the testing set. The optimized vision transformer reduces the inference latency by 40.70%, measured through a batch-processing approach, with a 55.63% faster training time than the baseline. Lastly, we developed an enhanced skip-frame approach to improve the inference latency by finding an optimal ratio of frames for prediction at inference, where we could further reduce the inference latency by 57.15%. This study reveals that the vision transformer model is highly optimizable for inference latency while maintaining the model performance.
引用
收藏
页码:1320 / 1339
页数:20
相关论文
共 75 条
[1]   STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition [J].
Ahn, Dasom ;
Kim, Sangwon ;
Hong, Hyunsu ;
Ko, Byoung Chul .
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :3319-3328
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]   A Survey on Object Detection for the Internet of Multimedia Things (IoMT) using Deep Learning and Event-based Middleware: Approaches, Challenges, and Future Directions [J].
Aslam, Asra ;
Curry, Edward .
IMAGE AND VISION COMPUTING, 2021, 106 (106)
[4]  
Bahsoon R, 2017, SOFTWARE ARCHITECTURE FOR BIG DATA AND THE CLOUD, P1, DOI 10.1016/B978-0-12-805467-3.00001-6
[5]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[6]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[7]   Efficient Video Classification Using Fewer Frames [J].
Bhardwaj, Shweta ;
Srinivasan, Mukundhan ;
Khapra, Mitesh M. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :354-363
[8]   Salient object detection: A survey [J].
Borji, Ali ;
Cheng, Ming-Ming ;
Hou, Qibin ;
Jiang, Huaizu ;
Li, Jia .
COMPUTATIONAL VISUAL MEDIA, 2019, 5 (02) :117-150
[9]   MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition [J].
Chen, Jiawei ;
Ho, Chiu Man .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :786-797
[10]  
Cipriani Giulio, 2021, Advances in Italian Mechanism Science. Proceedings of the 3rd International Conference of IFToMM Italy. Mechanisms and Machine Science (MMS 91), P260, DOI 10.1007/978-3-030-55807-9_30