Correspondence Matters for Video Referring Expression Comprehension

被引:8
作者
Cao, Meng [1 ]
Jiang, Ji [1 ]
Chen, Long [2 ]
Zou, Yuexian [1 ,3 ]
机构
[1] Peking Univ, SECE, Beijing, Peoples R China
[2] Columbia Univ, New York, NY 10027 USA
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video Referring Expression Comprehension; Inter-Frame Contrastive Learning; Cross-Modal Contrastive Learning; TRACKING;
D O I
10.1145/3503161.3547756
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We investigate the problem of video Referring Expression Comprehension (REC), which aims to localize the referent objects described in the sentence to visual regions in the video frames. Despite the recent progress, existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects. To this end, we propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners. Firstly, we aim to build the inter-frame correlations for all existing instances within the frames. Specifically, we compute the inter-frame patch-wise cosine similarity to estimate the dense alignment and then perform the inter-frame contrastive learning to map them close in feature space. Secondly, we propose to build the fine-grained patch-word alignment to associate each patch with certain words. Due to the lack of this kind of detailed annotations, we also predict the patch-word correspondence through the cosine similarity. Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimal model designs. Notably, our inter-frame and cross-modal contrastive losses are plug-and-play functions and are applicable to any video REC architectures. For example, by building on top of Co-grounding [44], we boost the performance by 1.48% absolute improvement on Accu.@0.5 for VID-Sentence dataset. Our codes are available at https://github.com/mengcaopku/DCNet.
引用
收藏
页码:4967 / 4976
页数:10
相关论文
共 72 条
  • [51] Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks
    Wang, Peng
    Wu, Qi
    Cao, Jiewei
    Shen, Chunhua
    Gao, Lianli
    van den Hengel, Anton
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1960 - 1968
  • [52] Unsupervised Feature Learning via Non-Parametric Instance Discrimination
    Wu, Zhirong
    Xiong, Yuanjun
    Yu, Stella X.
    Lin, Dahua
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3733 - 3742
  • [53] Yang S, 2019, AAAI CONF ARTIF INTE, P5644
  • [54] Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts
    Yang, Xun
    Liu, Xueliang
    Jian, Meng
    Gao, Xinjian
    Wang, Meng
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1939 - 1947
  • [55] A Fast and Accurate One-Stage Approach to Visual Grounding
    Yang, Zhengyuan
    Gong, Boqing
    Wang, Liwei
    Huang, Wenbing
    Yu, Dong
    Luo, Jiebo
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4682 - 4692
  • [56] Yang Zhengyuan, 2020, ECCV
  • [57] Yang Zhengyuan, 2020, IEEE T CIRCUITS SYST
  • [58] Sentence Directed Video Object Codiscovery
    Yu, Haonan
    Siskind, Jeffrey Mark
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 312 - 334
  • [59] MAttNet: Modular Attention Network for Referring Expression Comprehension
    Yu, Licheng
    Lin, Zhe
    Shen, Xiaohui
    Yang, Jimei
    Lu, Xin
    Bansal, Mohit
    Berg, Tamara L.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1307 - 1315
  • [60] A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
    Yu, Licheng
    Tan, Hao
    Bansal, Mohit
    Berg, Tamara L.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3521 - 3529