Enhanced Multimodal Representation Learning with Cross-modal KD

被引：3

作者：

Chen, Mengxi ^{[1
]}

Xing, Linyu ^{[1
]}

Wang, Yu ^{[1
,2
]}

Zhang, Ya ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[2] Shanghai AI Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

国家重点研发计划;

关键词：

NETWORKS;

D O I：

10.1109/CVPR52729.2023.01132

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification.

引用

页码：11766 / 11775

页数：10

共 50 条

[1] Quaternion Representation Learning for cross-modal matching
Wang, Zheng
Xu, Xing
Wei, Jiwei
Xie, Ning
Shao, Jie
Yang, Yang
KNOWLEDGE-BASED SYSTEMS, 2023, 270
[2] Cross-Modal Collaborative Communications
Zhou, Liang
Wu, Dan
Chen, Jianxin
Wei, Xin
IEEE WIRELESS COMMUNICATIONS, 2020, 27 (02) : 112 - 117
[3] Can We Exploit All Datasets? Multimodal Emotion Recognition Using Cross-Modal Translation
Yoon, Yeo Chan
IEEE ACCESS, 2022, 10 : 64516 - 64524
[4] Deep Noisy Multi-label Learning for Robust Cross-Modal Retrieval
Pu, Ruitao
Peng, Dezhong
Hua, Fujun
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 304 - 317
[5] Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
Liu, Xian
Wu, Qianyi
Zhou, Hang
Xu, Yinghao
Qian, Rui
Lin, Xinyi
Zhou, Xiaowei
Wu, Wayne
Dai, Bo
Zhou, Bolei
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10452 - 10462
[6] Cross-Modal 3D Shape Retrieval via Heterogeneous Dynamic Graph Representation
Dai, Yue
Feng, Yifan
Ma, Nan
Zhao, Xibin
Gao, Yue
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (04) : 2370 - 2387
[7] The State of the Art for Cross-Modal Retrieval: A Survey
Zhou, Kun
Hassan, Fadratul Hafinaz
Hoon, Gan Keng
IEEE ACCESS, 2023, 11 : 138568 - 138589
[8] Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking
Sun, Jingxian
Zhang, Lichao
Zha, Yufei
Gonzalez-Garcia, Abel
Zhang, Peng
Huang, Wei
Zhang, Yanning
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2262 - 2270
[9] Cross-modal coherent registration of whole mouse brains
Qu, Lei
Li, Yuanyuan
Xie, Peng
Liu, Lijuan
Wang, Yimin
Wu, Jun
Liu, Yu
Wang, Tao
Li, Longfei
Guo, Kaixuan
Wan, Wan
Ouyang, Lei
Xiong, Feng
Kolstad, Anna C.
Wu, Zhuhao
Xu, Fang
Zheng, Yefeng
Gong, Hui
Luo, Qingming
Bi, Guoqiang
Dong, Hongwei
Hawrylycz, Michael
Zeng, Hongkui
Peng, Hanchuan
NATURE METHODS, 2022, 19 (01) : 111 - +
[10] Resource Allocation for Multi-Traffic in Cross-Modal Communications
Wang, Lei
Yin, Anmin
Jiang, Xue
Chen, Mingkai
Dev, Kapal
Faseeh Qureshi, Nawab Muhammad
Yao, Jiming
Zheng, Baoyu
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2023, 20 (01): : 60 - 72

← 1 2 3 4 5 →