CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network

被引:187
作者
Peng, Yuxin [1 ]
Qi, Jinwei [1 ]
Huang, Xin [1 ]
Yuan, Yuxin [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; fine-grained correlation; joint optimization; multi-task learning; REPRESENTATION; MODEL;
D O I
10.1109/TMM.2017.2742704
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: 1) In the first learning stage, they only model intramodality correlation, but ignore intermodality correlation with rich complementary context. 2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems, this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network, and the contributions are as follows: 1) In the first learning stage, CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In the second learning stage, a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling, which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets, the experimental results show our CCL approach achieves the best performance.
引用
收藏
页码:405 / 420
页数:16
相关论文
共 48 条
  • [1] Multi-Task CNN Model for Attribute Prediction
    Abdulnabi, Abrar H.
    Wang, Gang
    Lu, Jiwen
    Jia, Kui
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (11) : 1949 - 1959
  • [2] Andrienko G., 2013, Introduction, P1
  • [3] [Anonymous], 2003, P ACM INT C MULT ACM
  • [4] [Anonymous], 2016, INT JOINT C ART INT
  • [5] [Anonymous], 2016, CORR
  • [6] [Anonymous], 2014, T ASSOC COMPUT LING
  • [7] [Anonymous], 2010, P NAACL HLT 2010 WOR
  • [8] [Anonymous], PROC CVPR IEEE
  • [9] [Anonymous], 2010, P 18 ACM INT C MULT
  • [10] [Anonymous], 2008, ICML 08, DOI 10.1145/1390156.1390294