Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism

被引:0
作者
Tai, Tianyang [1 ]
Zeng, Fanfeng [1 ]
机构
[1] North China Univ Technol, Coll Informat, Beijing, Peoples R China
来源
PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON ELECTRONIC INFORMATION TECHNOLOGY AND COMPUTER ENGINEERING, EITCE 2023 | 2023年
关键词
Multimodal fusion; Video retrieval; Attention mechanism;
D O I
10.1145/3650400.3650477
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature extraction and matching are crucial in video retrieval tasks. However, existing algorithms often overlook motion features in action-related videos and focus only on global static features. Distinguishing between key action features and background features is challenging, which hinders capturing global dependency relationships during the convolutional process. This results in less expressive features and reduced accuracy in video retrieval. In this paper, we propose a video retrieval model that combines multi-modal fusion and attention mechanism. Our model employs the Slow Fast backbone network, extracting skeleton motion features and static image features from video sequences using the Slow and Fast networks respectively. To address feature fusion, we introduce a 3D residual attention structure between the two branches. By incorporating bilateral connections and hash encoding, we construct a hash layer to map features into binary codes, improving retrieval efficiency. Experimental results on UCF101 and HMDB51 datasets validate the effectiveness of our approach, demonstrating its advantages over state-of-the-art video retrieval methods.
引用
收藏
页码:470 / 474
页数:5
相关论文
共 10 条
  • [1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [2] Chen H, 2022, A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for LargeScale Video Retrieval
  • [3] Dong Y., 2018, P 3 INT C MULT SYST
  • [4] SlowFast Networks for Video Recognition
    Feichtenhofer, Christoph
    Fan, Haoqi
    Malik, Jitendra
    He, Kaiming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
  • [5] Video Action Transformer Network
    Girdhar, Rohit
    Carreira, Joao
    Doersch, Carl
    Zisserman, Andrew
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 244 - 253
  • [6] Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval
    Gong, Yunchao
    Lazebnik, Svetlana
    Gordo, Albert
    Perronnin, Florent
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (12) : 2916 - 2929
  • [7] Deep Video Hashing
    Liong, Venice Erin
    Lu, Jiwen
    Tan, Yap-Peng
    Zhou, Jie
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (06) : 1209 - 1219
  • [8] Video Retrieval with Similarity-Preserving Deep Temporal Hashing
    Shen, Ling
    Hong, Richang
    Zhang, Haoran
    Tian, Xinmei
    Wang, Meng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (04)
  • [9] SIMONYAN K, 2022, Two-stream convolutional networks for action recognition in videos EB/OL
  • [10] CBAM: Convolutional Block Attention Module
    Woo, Sanghyun
    Park, Jongchan
    Lee, Joon-Young
    Kweon, In So
    [J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 3 - 19