Semantics Disentangling for Cross-Modal Retrieval

被引:11
作者
Wang, Zheng [1 ,2 ,3 ]
Xu, Xing [4 ,5 ]
Wei, Jiwei [4 ,5 ]
Xie, Ning [4 ,5 ]
Yang, Yang [1 ,2 ,3 ]
Shen, Heng Tao [4 ,5 ,6 ]
机构
[1] Univ Elect Sci & Technol China UESTC, Ctr Future Multimedia, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China UESTC, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[3] UESTC Guangdong, Inst Elect & Informat Engn, Dongguan 523808, Peoples R China
[4] Univ Elect Sci & Technol China UESTC, Ctr Future Multimedia, Chengdu 611731, Peoples R China
[5] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[6] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; semantics disentangling; dual adversarial mechanism; subspace learning; REPRESENTATION;
D O I
10.1109/TIP.2024.3374111
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval (e.g., query a given image to obtain a semantically similar sentence, and vice versa) is an important but challenging task, as the heterogeneous gap and inconsistent distributions exist between different modalities. The dominant approaches struggle to bridge the heterogeneity by capturing the common representations among heterogeneous data in a constructed subspace which can reflect the semantic closeness. However, insufficient consideration is taken into the fact that learned latent representations are actually heavily entangled with those semantic-unrelated features, which obviously further compounds the challenges of cross-modal retrieval. To alleviate the difficulty, this work makes an assumption that the data are jointly characterized by two independent features: semantic-shared and semantic-unrelated representations. The former presents characteristics of consistent semantics shared by different modalities, while the latter reflects the characteristics with respect to the modality yet unrelated to semantics, such as background, illumination, and other low-level information. Therefore, this paper aims to disentangle the shared semantics from the entangled features, andthus the purer semantic representation can promote the closeness of paired data. Specifically, this paper designs a novel Semantics Disentangling approach for Cross-Modal Retrieval (termed as SDCMR) to explicitly decouple the two different features based on variational auto-encoder. Next, the reconstruction is performed by exchanging shared semantics to ensure the learning of semantic consistency. Moreover, a dual adversarial mechanism is designed to disentangle the two independent features via a pushing-and-pulling strategy. Comprehensive experiments on four widely used datasets demonstrate the effectiveness and superiority of the proposed SDCMR method by achieving a new bar on performance when compared against 15 state-of-the-art methods.
引用
收藏
页码:2226 / 2237
页数:12
相关论文
共 67 条
[1]  
[Anonymous], 2016, IJCAI
[2]  
[Anonymous], 2013, P AAAI C ARTIFICIAL
[3]  
[Anonymous], 2013, P AAAI C ART INT, DOI DOI 10.1609/AAAI.V27I1.8603
[4]  
[Anonymous], 2009, P ACM INT C IM VID R
[5]   Representation Learning: A Review and New Perspectives [J].
Bengio, Yoshua ;
Courville, Aaron ;
Vincent, Pascal .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828
[6]   Semantics Disentangling for Generalized Zero-Shot Learning [J].
Chen, Zhi ;
Luo, Yadan ;
Qiu, Ruihong ;
Wang, Sen ;
Huang, Zi ;
Li, Jingjing ;
Zhang, Zheng .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :8692-8700
[7]  
Deng J., 2020, Wireless Netw., V3, P1
[8]   Cross-modal Retrieval with Correspondence Autoencoder [J].
Feng, Fangxiang ;
Wang, Xiaojie ;
Li, Ruifan .
PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :7-16
[9]  
Ganin Y, 2015, PR MACH LEARN RES, V37, P1180
[10]  
Gong C, 2017, AAAI CONF ARTIF INTE, P1926