Learning Joint Multimodal Representation with Adversarial Attention Networks

被引:17
作者
Huang, Feiran [1 ]
Zhang, Xiaoming [2 ]
Li, Zhoujun [1 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
multimodal; representation learning; adversarial networks; attention model; siamese learning;
D O I
10.1145/3240508.3240614
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.
引用
收藏
页码:1874 / 1882
页数:9
相关论文
共 50 条
  • [31] Joint Image-text Representation Learning for Fashion Retrieval
    Yan, Cairong
    Li, Yu
    Wan, Yongquan
    Zhang, Zhaohui
    ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 412 - 417
  • [32] Generating Adversarial Examples by Adversarial Networks for Semi-supervised Learning
    Ma, Yun
    Mao, Xudong
    Chen, Yangbin
    Li, Qing
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2019, 2019, 11881 : 115 - 129
  • [33] On Representation Learning for Road Networks
    Wang, Meng-Xiang
    Lee, Wang-Chien
    Fu, Tao-Yang
    Yu, Ge
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (01)
  • [34] CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
    Yu, Licheng
    Chen, Jun
    Sinha, Animesh
    Wang, Mengjiao
    Chen, Yu
    Berg, Tamara L.
    Zhang, Ning
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 4433 - 4442
  • [35] DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks
    Sankar, Aravind
    Wu, Yanhong
    Gou, Liang
    Zhang, Wei
    Yang, Hao
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20), 2020, : 519 - 527
  • [36] A student performance prediction model based on multimodal generative adversarial networks
    Liu, Junjie
    Yang, Yong
    INTERNATIONAL JOURNAL OF SENSOR NETWORKS, 2025, 47 (03) : 186 - 198
  • [37] Multimodal Fusion Representation Learning Based on Differential Privacy
    Cai, Chaoxin
    Sang, Yingpeng
    Huang, Jinghao
    Zhang, Maliang
    Li, Weizheng
    PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PDCAT 2021, 2022, 13148 : 548 - 559
  • [38] Multimodal Cardiac Segmentation Using Disentangled Representation Learning
    Chartsias, Agisilaos
    Papanastasiou, Giorgos
    Wang, Chengjia
    Stirrat, Colin
    Semple, Scott
    Newby, David
    Dharmakumar, Rohan
    Tsaftaris, Sotirios A.
    STATISTICAL ATLASES AND COMPUTATIONAL MODELS OF THE HEART: MULTI-SEQUENCE CMR SEGMENTATION, CRT-EPIGGY AND LV FULL QUANTIFICATION CHALLENGES, 2020, 12009 : 128 - 137
  • [39] Triple disentangled representation learning for multimodal affective analysis
    Zhou, Ying
    Liang, Xuefeng
    Chen, Han
    Zhao, Yin
    Chen, Xin
    Yu, Lida
    INFORMATION FUSION, 2024, 114
  • [40] Feature Equilibrium: An Adversarial Training Method to Improve Representation Learning
    Liu, Minghui
    Yang, Meiyi
    Deng, Jiali
    Cheng, Xuan
    Xie, Tianshu
    Deng, Pan
    Gong, Haigang
    Liu, Ming
    Wang, Xiaomin
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)