Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

被引:6
作者
Cheng, Qingrong [1 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Attention mechanism; Cross-modal retrieval; Bidirectional LSTM; Fine-grained similarity;
D O I
10.1007/s11042-020-09450-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more fine-grained features to evaluate the similarity among multimedia samples. In this paper, we propose aDeep Attentional Fine-grained Similarity Network(DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the "heterogeneity gap" between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section.
引用
收藏
页码:31401 / 31428
页数:28
相关论文
共 48 条
[1]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[2]  
Andrew G, 2010, INT C MACH LEARN, P3408
[3]   Bridging memory-based collaborative filtering and text retrieval [J].
Bellogin, Alejandro ;
Wang, Jun ;
Castells, Pablo .
INFORMATION RETRIEVAL, 2013, 16 (06) :697-724
[4]   Generative Models for Fast Calorimeter Simulation: the LHCb case [J].
Chekalina, Viktoria ;
Orlova, Elena ;
Ratnikov, Fedor ;
Ulyanov, Dmitry ;
Ustyuzhanin, Andrey ;
Zakharov, Egor .
23RD INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2018), 2019, 214
[5]   Fine-grained attention mechanism for neural machine translation [J].
Choi, Heeyoul ;
Cho, Kyunghyun ;
Bengio, Yoshua .
NEUROCOMPUTING, 2018, 284 :171-176
[6]  
Chua Tat-Seng, 2009, P ACM C IM VID RETR
[7]   On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval [J].
Costa Pereira, Jose ;
Coviello, Emanuele ;
Doyle, Gabriel ;
Rasiwasia, Nikhil ;
Lanckriet, Gert R. G. ;
Levy, Roger ;
Vasconcelos, Nuno .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (03) :521-535
[8]   Cross-modal Retrieval with Correspondence Autoencoder [J].
Feng, Fangxiang ;
Wang, Xiaojie ;
Li, Ruifan .
PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :7-16
[9]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[10]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672