Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval

被引:2
作者
Wu, Hongchang [1 ]
Guan, Ziyu [2 ]
Zhi, Tao [3 ]
zhao, Wei [1 ]
Xu, Cai [2 ]
Han, Hong [2 ]
Yang, Yarning [2 ]
机构
[1] Xidian Univ, Sch Comp Sci & Technol, Xian, Peoples R China
[2] Xidian Univ, Xian, Peoples R China
[3] Xidian Univ, Sch Artificial Intelligence, Xian, Peoples R China
来源
2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019) | 2019年
关键词
Cross-modal retrieval; graph attention; self attention; generative adversarial network;
D O I
10.1109/ICBK.2019.00043
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing cross-modal retrieval methods are mainly constrained to the bimodal case. When applied to the multi-modal case, we need to train O(K-2) (K: number of modalities) separate models, which is inefficient and unable to exploit common information among multiple modalities. Though some studies focused on learning a common space of multiple modalities for retrieval, they assumed data to be i.i.d. and failed to learn the underlying semantic structure which could be important for retrieval. To tackle this issue, we propose an extensive Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval (AGAT). AGAT synthesizes a self-attention network (SAT), a graph attention network (GAT) and a multi-modal generative adversarial network (MGAN). The SAT generates high-level embeddings for data items from different modalities, with self-attention capturing feature-level correlations in each modality. The GAT then uses attention to aggregate embeddings of matched items from different modalities to build a common embedding space. The MGAN aims to "cluster" matched embeddings of different modalities in the common space by forcing them to be similar to the aggregation. Finally, we train the common space so that it captures the semantic structure by constraining within-class/between-class distances. Experiments on three datasets show the effectiveness of AGAT.
引用
收藏
页码:265 / 272
页数:8
相关论文
共 21 条
  • [1] [Anonymous], 2017, ARXIV
  • [2] [Anonymous], 2017, ARXIV171109347
  • [3] [Anonymous], 2018, IEEE transactions on cybernetics
  • [4] Chi JZ, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P663
  • [5] Cross-modal Retrieval with Correspondence Autoencoder
    Feng, Fangxiang
    Wang, Xiaojie
    Li, Ruifan
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16
  • [6] Deep Semantic Correlation Learning based Hashing for Multimedia Cross-Modal Retrieval
    Gong, Xiaolong
    Huang, Linpeng
    Wang, Fuwei
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 117 - 126
  • [7] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
  • [8] Hotelling H, 1936, BIOMETRIKA, V28, P321, DOI 10.2307/2333955
  • [9] Leveraging Meta-path based Context for Top-N Recommendation with A Neural Co-Attention Model
    Hu, Binbin
    Shi, Chuan
    Zhao, Wayne Xin
    Yu, Philip S.
    [J]. KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 1531 - 1540
  • [10] Ngiam Jiquan, 2011, P INT C MACH LEARN, P689