MARS: Learning Modality-Agnostic Representation for Scalable Cross-Media Retrieval

被引：25

作者：

Wang, Yunbo ^{[1
]}

Peng, Yuxin ^{[1
]}

机构：

[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100871, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Semantics; Correlation; Training; Cats; Automobiles; Transforms; Media; Multi-modality learning; cross-media retrieval; modality scalability; similarity retrieval;

D O I：

10.1109/TCSVT.2021.3136330

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-media retrieval (CMR) offers a flexible retrieval experience across multiple modalities. Existing CMR approaches are constrained by the assumption that the paired modalities are available in training, and they leverage the data of all modalities to obtain a common representation. However, as dealing with the data from new modality, the previous all modalities need to be re-trained, compromising the flexibility and practicality of CMR. In this paper, we propose an approach termed learning Modality-Agnostic Representation for Scalable cross-media retrieval (MARS), which allows each modality to be trained independently. To be specific, MARS treats the label information as a distinct modality, and introduces a label parsing module LabNet to generate semantic representation for correlating different modalities. Meanwhile, MARS constructs the modality-specific representation module DataNet to obtain the modality-shared representation and modality-exclusive representation equipped with unbiased semantic classification. Technically, for the first modality, we jointly train the LabNet and its DataNet to preserve the semantic similarity between the Label-derived representation and the modality-shared representation. For new modalities, MARS employs the well-learned LabNet to extract the representation in labels, and then such representation is served as the privilege to guide the associated DataNet training via the same objective. Furthermore, we assign the same classifier to the representation module of all modalities for better semantic alignment. With the above schema, the obtained modality-shared representation is considered to be modality-agnostic. Extensive experiments on several benchmark multi-modality datasets demonstrate that the proposed MARS achieves better results than existing methods.

引用

页码：4765 / 4777

页数：13

共 67 条

[1]

Andrew G, 2013, INT C MACH LEARN, V28

[2]

[Anonymous], 2016, IJCAI

[3]

[Anonymous], 2017, IEEE T PATTERN ANAL, DOI [DOI 10.1109/TPAMI.2016.2610969, 10.1109/TPAMI.2016.2610969]

[4]

[Anonymous], 2013, P AAAI C ART INT, DOI DOI 10.1609/AAAI.V27I1.8603

[5] SCRATCH: A Scalable Discrete Matrix Factorization Hashing Framework for Cross-Modal Retrieval [J].

Chen, Zhen-Duo ;

Li, Chuan-Xiang ;

Luo, Xin ;

Nie, Liqiang ;

Zhang, Wei ;

Xu, Xin-Shun .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (07) :2262-2275

[6]

Chua T.S., 2009, P ACM INT C IM VID R

[7] Collective Matrix Factorization Hashing for Multimodal Data [J].

Ding, Guiguang ;

Guo, Yuchen ;

Zhou, Jile .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2083-2090

[8] Adversarial Graph Convolutional Network for Cross-Modal Retrieval [J].

Dong, Xinfeng ;

Liu, Li ;

Zhu, Lei ;

Nie, Liqiang ;

Zhang, Huaxiang .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) :1634-1645

[9] A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics [J].

Gong, Yunchao ;

Ke, Qifa ;

Isard, Michael ;

Lazebnik, Svetlana .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 106 (02) :210-233

[10] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].

Gu, Jiuxiang ;

Cai, Jianfei ;

Joty, Shafiq ;

Niu, Li ;

Wang, Gang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189

← 1 2 3 4 5 6 7 →