Enhanced Multimodal Representation Learning with Cross-modal KD

被引:3
|
作者
Chen, Mengxi [1 ]
Xing, Linyu [1 ]
Wang, Yu [1 ,2 ]
Zhang, Ya [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Shanghai AI Lab, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
国家重点研发计划;
关键词
NETWORKS;
D O I
10.1109/CVPR52729.2023.01132
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification.
引用
收藏
页码:11766 / 11775
页数:10
相关论文
共 50 条
  • [1] Quaternion Representation Learning for cross-modal matching
    Wang, Zheng
    Xu, Xing
    Wei, Jiwei
    Xie, Ning
    Shao, Jie
    Yang, Yang
    KNOWLEDGE-BASED SYSTEMS, 2023, 270
  • [2] Cross-Modal Collaborative Communications
    Zhou, Liang
    Wu, Dan
    Chen, Jianxin
    Wei, Xin
    IEEE WIRELESS COMMUNICATIONS, 2020, 27 (02) : 112 - 117
  • [4] Deep Noisy Multi-label Learning for Robust Cross-Modal Retrieval
    Pu, Ruitao
    Peng, Dezhong
    Hua, Fujun
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 304 - 317
  • [5] Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
    Liu, Xian
    Wu, Qianyi
    Zhou, Hang
    Xu, Yinghao
    Qian, Rui
    Lin, Xinyi
    Zhou, Xiaowei
    Wu, Wayne
    Dai, Bo
    Zhou, Bolei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10452 - 10462
  • [6] Cross-Modal 3D Shape Retrieval via Heterogeneous Dynamic Graph Representation
    Dai, Yue
    Feng, Yifan
    Ma, Nan
    Zhao, Xibin
    Gao, Yue
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (04) : 2370 - 2387
  • [7] The State of the Art for Cross-Modal Retrieval: A Survey
    Zhou, Kun
    Hassan, Fadratul Hafinaz
    Hoon, Gan Keng
    IEEE ACCESS, 2023, 11 : 138568 - 138589
  • [8] Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking
    Sun, Jingxian
    Zhang, Lichao
    Zha, Yufei
    Gonzalez-Garcia, Abel
    Zhang, Peng
    Huang, Wei
    Zhang, Yanning
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2262 - 2270
  • [9] Cross-modal coherent registration of whole mouse brains
    Qu, Lei
    Li, Yuanyuan
    Xie, Peng
    Liu, Lijuan
    Wang, Yimin
    Wu, Jun
    Liu, Yu
    Wang, Tao
    Li, Longfei
    Guo, Kaixuan
    Wan, Wan
    Ouyang, Lei
    Xiong, Feng
    Kolstad, Anna C.
    Wu, Zhuhao
    Xu, Fang
    Zheng, Yefeng
    Gong, Hui
    Luo, Qingming
    Bi, Guoqiang
    Dong, Hongwei
    Hawrylycz, Michael
    Zeng, Hongkui
    Peng, Hanchuan
    NATURE METHODS, 2022, 19 (01) : 111 - +
  • [10] Resource Allocation for Multi-Traffic in Cross-Modal Communications
    Wang, Lei
    Yin, Anmin
    Jiang, Xue
    Chen, Mingkai
    Dev, Kapal
    Faseeh Qureshi, Nawab Muhammad
    Yao, Jiming
    Zheng, Baoyu
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2023, 20 (01): : 60 - 72