CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network

被引：187

作者：

Peng, Yuxin ^{[1
]}

Qi, Jinwei ^{[1
]}

Huang, Xin ^{[1
]}

Yuan, Yuxin ^{[1
]}

机构：

[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2018年 / 20卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; fine-grained correlation; joint optimization; multi-task learning; REPRESENTATION; MODEL;

D O I：

10.1109/TMM.2017.2742704

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: 1) In the first learning stage, they only model intramodality correlation, but ignore intermodality correlation with rich complementary context. 2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems, this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network, and the contributions are as follows: 1) In the first learning stage, CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In the second learning stage, a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling, which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets, the experimental results show our CCL approach achieves the best performance.

引用

页码：405 / 420

页数：16

共 48 条

[1] Multi-Task CNN Model for Attribute Prediction
Abdulnabi, Abrar H.
Wang, Gang
Lu, Jiwen
Jia, Kui
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (11) : 1949 - 1959
[2] Andrienko G., 2013, Introduction, P1
[3] [Anonymous], 2003, P ACM INT C MULT ACM
[4] [Anonymous], 2016, INT JOINT C ART INT
[5] [Anonymous], 2016, CORR
[6] [Anonymous], 2014, T ASSOC COMPUT LING
[7] [Anonymous], 2010, P NAACL HLT 2010 WOR
[8] [Anonymous], PROC CVPR IEEE
[9] [Anonymous], 2010, P 18 ACM INT C MULT
[10] [Anonymous], 2008, ICML 08, DOI 10.1145/1390156.1390294

← 1 2 3 4 5 →