Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval

被引:121
作者
Xu, Xing [1 ,2 ]
Lu, Huimin [3 ,4 ]
Song, Jingkuan [1 ,2 ]
Yang, Yang [1 ,2 ]
Shen, Heng Tao [1 ,2 ]
Li, Xuelong [5 ,6 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Multimedia, Chengdu 610051, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610051, Peoples R China
[3] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
[4] Kyushu Inst Technol, Dept Mech & Control Engn, Kitakyushu, Fukuoka 8048550, Japan
[5] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[6] Northwestern Polytech Univ, Ctr Opt Imagery Anal & Learning, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Correlation; Knowledge transfer; Standards; Task analysis; Training; Feature extraction; Adversarial learning; cross-modal retrieval; self-supervision; zero-shot learning (ZSL); FUSION;
D O I
10.1109/TCYB.2019.2928180
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a query instance from one modality (e.g., image), cross-modal retrieval aims to find semantically similar instances from another modality (e.g., text). To perform cross-modal retrieval, existing approaches typically learn a common semantic space from a labeled source set and directly produce common representations in the learned space for the instances in a target set. These methods commonly require that the instances of both two sets share the same classes. Intuitively, they may not generalize well on a more practical scenario of zero-shot cross-modal retrieval, that is, the instances of the target set contain unseen classes that have inconsistent semantics with the seen classes in the source set. Inspired by zero-shot learning, we propose a novel model called ternary adversarial networks with self-supervision (TANSS) in this paper, to overcome the limitation of the existing methods on this challenging task. Our TANSS approach consists of three paralleled subnetworks: 1) two semantic feature learning subnetworks that capture the intrinsic data structures of different modalities and preserve the modality relationships via semantic features in the common semantic space; 2) a self-supervised semantic subnetwork that leverages the word vectors of both seen and unseen labels as guidance to supervise the semantic feature learning and enhances the knowledge transfer to unseen labels; and 3) we also utilize the adversarial learning scheme in our TANSS to maximize the consistency and correlation of the semantic features between different modalities. The three subnetworks are integrated in our TANSS to formulate an end-to-end network architecture which enables efficient iterative parameter optimization. Comprehensive experiments on three cross-modal datasets show the effectiveness of our TANSS approach compared with the state-of-the-art methods for zero-shot cross-modal retrieval.
引用
收藏
页码:2400 / 2413
页数:14
相关论文
共 62 条
[1]   Label-Embedding for Attribute-Based Classification [J].
Akata, Zeynep ;
Perronnin, Florent ;
Harchaoui, Zaid ;
Schmid, Cordelia .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :819-826
[2]  
[Anonymous], IEEE T NEURAL NETW L
[3]  
[Anonymous], 2003, P 11 ACM INT C MULT
[4]  
[Anonymous], 2010, P NAACL HLT 2010WORK
[5]  
[Anonymous], P INT C LEARN REPR W
[6]  
[Anonymous], 2010, P 18 ACM INT C MULT, DOI 10.1145/1873951.1873987
[7]  
[Anonymous], 2017, ARXIV170303567
[8]  
[Anonymous], ABS180201943 CORR
[9]  
[Anonymous], P INT C LEARN REPR
[10]  
[Anonymous], ABS170604987 CORR