Multi-Attention Network for Compressed Video Referring Object Segmentation

被引:23
|
作者
Chen, Weidong [1 ]
Hong, Dexiang [1 ]
Qi, Yuankai [2 ]
Han, Zhenjun [1 ]
Wang, Shuhui [3 ]
Qing, Laiyun [1 ]
Huang, Qingming [1 ,3 ]
Li, Guorong [1 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Univ Adelaide, Adelaide, SA, Australia
[3] Chinese Acad Sci, ICT, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed Video Understanding; Vision and Language; Dual-path; Dual-attention; Multi-modal Transformer;
D O I
10.1145/3503161.3547761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.
引用
收藏
页码:4416 / 4425
页数:10
相关论文
共 50 条
  • [21] Spectrum-guided Multi-granularity Referring Video Object Segmentation
    Miao, Bo
    Bennamoun, Mohammed
    Gao, Yongsheng
    Mian, Ajmal
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 920 - 930
  • [22] Multi-Attention Fusion Network for Video-based Emotion Recognition
    Wang, Yanan
    Wu, Jianming
    Hoashi, Keiichiro
    ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, : 595 - 601
  • [23] Compressed Domain Video Object Segmentation
    Porikli, Fatih
    Bashir, Faisal
    Sun, Huifang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2010, 20 (01) : 2 - 14
  • [24] Multi-Object Tracking Via Multi-Attention
    Wang, Xianrui
    Ling, Hefei
    Chen, Jiazhong
    Li, Ping
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [25] RANet: Ranking Attention Network for Fast Video Object Segmentation
    Wang, Ziqin
    Xu, Jun
    Liu, Li
    Zhu, Fan
    Shao, Ling
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 3977 - 3986
  • [26] Real-time and multi-video-object segmentation for compressed video sequences
    Fu Wenxiu
    Wang Bin
    Liu Ming
    ICNC 2007: THIRD INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 3, PROCEEDINGS, 2007, : 747 - +
  • [27] Multi-Attention Network for Sentiment Analysis
    Du, Tingting
    Huang, Yunyin
    Wu, Xian
    Chang, Huiyou
    PROCEEDINGS OF THE 2018 2ND INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL (NLPIR 2018), 2018, : 49 - 54
  • [28] Object-Agnostic Transformers for Video Referring Segmentation
    Yang, Xu
    Wang, Hao
    Xie, De
    Deng, Cheng
    Tao, Dacheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2839 - 2849
  • [29] A closer look at referring expressions for video object segmentation
    Miriam Bellver
    Carles Ventura
    Carina Silberer
    Ioannis Kazakos
    Jordi Torres
    Xavier Giro-i-Nieto
    Multimedia Tools and Applications, 2023, 82 : 4419 - 4438
  • [30] Video Object Segmentation Using Multi-Scale Attention-Based Siamese Network
    Zhu, Zhiliang
    Qiu, Leiningxin
    Wang, Jiaxin
    Xiong, Jinquan
    Peng, Hua
    ELECTRONICS, 2023, 12 (13)