Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism

被引：0

作者：

Tai, Tianyang ^{[1
]}

Zeng, Fanfeng ^{[1
]}

机构：

[1] North China Univ Technol, Coll Informat, Beijing, Peoples R China

来源：

PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON ELECTRONIC INFORMATION TECHNOLOGY AND COMPUTER ENGINEERING, EITCE 2023 | 2023年

关键词：

Multimodal fusion; Video retrieval; Attention mechanism;

D O I：

10.1145/3650400.3650477

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Feature extraction and matching are crucial in video retrieval tasks. However, existing algorithms often overlook motion features in action-related videos and focus only on global static features. Distinguishing between key action features and background features is challenging, which hinders capturing global dependency relationships during the convolutional process. This results in less expressive features and reduced accuracy in video retrieval. In this paper, we propose a video retrieval model that combines multi-modal fusion and attention mechanism. Our model employs the Slow Fast backbone network, extracting skeleton motion features and static image features from video sequences using the Slow and Fast networks respectively. To address feature fusion, we introduce a 3D residual attention structure between the two branches. By incorporating bilateral connections and hash encoding, we construct a hash layer to map features into binary codes, improving retrieval efficiency. Experimental results on UCF101 and HMDB51 datasets validate the effectiveness of our approach, demonstrating its advantages over state-of-the-art video retrieval methods.

引用

页码：470 / 474

页数：5

共 10 条

[1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[2] Chen H, 2022, A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for LargeScale Video Retrieval
[3] Dong Y., 2018, P 3 INT C MULT SYST
[4] SlowFast Networks for Video Recognition
Feichtenhofer, Christoph
Fan, Haoqi
Malik, Jitendra
He, Kaiming
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
[5] Video Action Transformer Network
Girdhar, Rohit
Carreira, Joao
Doersch, Carl
Zisserman, Andrew
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 244 - 253
[6] Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval
Gong, Yunchao
Lazebnik, Svetlana
Gordo, Albert
Perronnin, Florent
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (12) : 2916 - 2929
[7] Deep Video Hashing
Liong, Venice Erin
Lu, Jiwen
Tan, Yap-Peng
Zhou, Jie
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (06) : 1209 - 1219
[8] Video Retrieval with Similarity-Preserving Deep Temporal Hashing
Shen, Ling
Hong, Richang
Zhang, Haoran
Tian, Xinmei
Wang, Meng
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (04)
[9] SIMONYAN K, 2022, Two-stream convolutional networks for action recognition in videos EB/OL
[10] CBAM: Convolutional Block Attention Module
Woo, Sanghyun
Park, Jongchan
Lee, Joon-Young
Kweon, In So
[J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 3 - 19

← 1 →