Transformer-based multi-level attention integration network for video saliency prediction

被引:0
|
作者
Rui Tan [1 ]
Minghui Sun [3 ]
Yanhua Liang [2 ]
机构
[1] Jilin University,Software College
[2] Jilin University,College of Computer Science and Technology
[3] Jilin University,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
关键词
Video saliency prediction; Transformer; Spatio-temporal feature; Self-attention;
D O I
10.1007/s11042-024-19404-4
中图分类号
学科分类号
摘要
Most existing models for video saliency prediction heavily rely on 3D convolutional operations to extract spatio-temporal features. However, it is worth noting that 3D convolution produces a local receptive field, which may struggle to capture long-range spatio-temporal dependencies effectively. To compensate for such shortage, this paper introduces a novel approach called the Transformer-based Multi-level Attention Integration Network (TMAI-Net) for video saliency prediction. TMAI-Net is designed as a two-stream encoder-decoder model, carefully integrating multi-level features of semantic information. Our model incorporates a Multi-level Interactive Attention(MLIA) module and a Transformer, both implemented based on self-attention mechanism, which are placed at different levels of the model to capture long-range spatio-temporal feature dependencies. Additionally, our model operates on input video frames and attentional patches, allowing the Transformer module to capture structural similarities between related objects in global features and attention features. This, in turn, enables the model to allocate increased attention to salient areas. The efficacy of our proposed approach is validated through extensive experiments conducted on three widely recognized benchmark datasets.
引用
收藏
页码:11833 / 11854
页数:21
相关论文
共 50 条
  • [1] Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction
    Zhou, Xiaofei
    Wu, Songhe
    Shi, Ran
    Zheng, Bolun
    Wang, Shuai
    Yin, Haibing
    Zhang, Jiyong
    Yan, Chenggang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7696 - 7707
  • [2] TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation
    Wang, Ruotong
    Shen, Yanqing
    Zuo, Weiliang
    Zhou, Sanping
    Zheng, Nanning
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13638 - 13647
  • [3] A Deep Multi-Level Network for Saliency Prediction
    Cornia, Marcella
    Baraldi, Lorenzo
    Serra, Giuseppe
    Cucchiara, Rita
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3488 - 3493
  • [4] Transformer-based attention network for stock movement prediction
    Zhang, Qiuyue
    Qin, Chao
    Zhang, Yunfeng
    Bao, Fangxun
    Zhang, Caiming
    Liu, Peide
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 202
  • [5] Multi-Level Transformer-Based Social Relation Recognition
    Wang, Yuchen
    Qing, Linbo
    Wang, Zhengyong
    Cheng, Yongqiang
    Peng, Yonghong
    SENSORS, 2022, 22 (15)
  • [6] Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection
    Zhuang, Xuqiang
    Liu, Fangai
    Hou, Jian
    Hao, Jianhua
    Cai, Xiaohong
    NEURAL PROCESSING LETTERS, 2022, 54 (03) : 1943 - 1960
  • [7] Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection
    Xuqiang Zhuang
    Fangai Liu
    Jian Hou
    Jianhua Hao
    Xiaohong Cai
    Neural Processing Letters, 2022, 54 : 1943 - 1960
  • [8] SATSal: A Multi-Level Self-Attention Based Architecture for Visual Saliency Prediction
    Tliba, Marouane
    Kerkouri, Mohamed A.
    Ghariba, Bashir
    Chetouani, Aladine
    Coeltekin, Arzu
    Shehata, Mohamed
    Bruno, Alessandro
    IEEE ACCESS, 2022, 10 : 20701 - 20713
  • [9] MULTI-LEVEL MODEL FOR VIDEO SALIENCY DETECTION
    Bi, Hongbo
    Lu, Di
    Li, Ning
    Yang, Lina
    Guan, Huaping
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 4654 - 4658
  • [10] A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining
    Shi, Bowen
    Jiang, Dongsheng
    Zhang, Xiaopeng
    Li, Han
    Dai, Wenrui
    Zou, Junni
    Xiong, Hongkai
    Tian, Qi
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 624 - 639