Visual Grounding With Dual Knowledge Distillation

被引:0
|
作者
Wu, Wansen [1 ]
Cao, Meng [2 ]
Hu, Yue [1 ]
Peng, Yong [1 ]
Qin, Long [1 ]
Yin, Quanjun [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410072, Peoples R China
[2] Tencent AI Lab, Shenzhen 518000, Peoples R China
基金
湖南省自然科学基金; 中国国家自然科学基金;
关键词
Visualization; Task analysis; Semantics; Grounding; Feature extraction; Location awareness; Proposals; Visual grounding; vision and language; knowledge distillation;
D O I
10.1109/TCSVT.2024.3407785
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual grounding is a task that seeks to predict the specific location of an object or region described by a linguistic expression within an image. Despite the recent success, existing methods still suffer from two problems. First, most methods use independently pre-trained unimodal feature encoders for extracting expressive feature embeddings, thus resulting in a significant semantic gap between unimodal embeddings and limiting the effective interaction of visual-linguistic contexts. Second, existing attention-based approaches equipped with the global receptive field have a tendency to neglect the local information present in the images. This limitation restricts the semantic understanding required to distinguish between referred objects and the background, consequently leading to inadequate localization performance. Inspired by the recent advance in knowledge distillation, in this paper, we propose a DUal knowlEdge disTillation (DUET) method for visual grounding models to bridge the cross-modal semantic gap and improve localization performance simultaneously. Specifically, we utilize the CLIP model as the teacher model to transfer the semantic knowledge to a student model, in which the vision and language modalities are linked into a unified embedding space. Besides, we design a self-distillation method for the student model to acquire localization knowledge by performing the region-level contrastive learning to make the predicted region close to the positive samples. To this end, this work further proposes a Semantics-Location Aware sampling mechanism to generate high-quality self-distillation samples. Extensive experiments on five datasets and ablation studies demonstrate the state-of-the-art performance of DUET and its orthogonality with different student models, thereby making DUET adaptable to a wide range of visual grounding architectures. Code are available on DUET.
引用
收藏
页码:10399 / 10410
页数:12
相关论文
共 50 条
  • [41] Knowledge Distillation Classifier Generation Network for Zero-Shot Learning
    Yu, Yunlong
    Li, Bin
    Ji, Zhong
    Han, Jungong
    Zhang, Zhongfei
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (06) : 3183 - 3194
  • [42] Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution
    Yang, Chuanguang
    An, Zhulin
    Cai, Linhang
    Xu, Yongjun
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2094 - 2108
  • [43] Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue
    Yu, Jing
    Jiang, Xiaoze
    Qin, Zengchang
    Zhang, Weifeng
    Hu, Yue
    Wu, Qi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 220 - 233
  • [44] Remote Sensing Image Scene Classification Model Based on Dual Knowledge Distillation
    Li, Daxiang
    Nan, Yixuan
    Liu, Ying
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [45] Dual-Level Knowledge Distillation via Knowledge Alignment and Correlation
    Ding, Fei
    Yang, Yin
    Hu, Hongxin
    Krovi, Venkat
    Luo, Feng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2425 - 2435
  • [46] DUAL KNOWLEDGE DISTILLATION FOR EFFICIENT SOUND EVENT DETECTION
    Xiao, Yang
    Das, Rohan Kumar
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 690 - 694
  • [47] Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering
    Zhang, Liyang
    Liu, Shuaicheng
    Liu, Donghao
    Zeng, Pengpeng
    Li, Xiangpeng
    Song, Jingkuan
    Gao, Lianli
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (10) : 4362 - 4373
  • [48] Lightweight Model Pre-Training via Language Guided Knowledge Distillation
    Li, Mingsheng
    Zhang, Lin
    Zhu, Mingzhen
    Huang, Zilong
    Yu, Gang
    Fan, Jiayuan
    Chen, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10720 - 10730
  • [49] Scene Text Detection in Foggy Weather Utilizing Knowledge Distillation of Diffusion Models
    Liu, Zhaoxi
    Zhou, Gang
    Jia, Zhenhong
    Shi, Fei
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 996 - 1000
  • [50] A Teacher-Free Graph Knowledge Distillation Framework With Dual Self-Distillation
    Wu, Lirong
    Lin, Haitao
    Gao, Zhangyang
    Zhao, Guojiang
    Li, Stan Z.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (09) : 4375 - 4385