Multimodal Alignment and Attention-Based Person Search via Natural Language Description

被引：10

作者：

Ji, Zhong ^{[1
]}

Li, Shengjia ^{[1
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

来源：

IEEE INTERNET OF THINGS JOURNAL | 2020年 / 7卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Task analysis; Natural languages; Visualization; Internet of Things; Cameras; Sensors; Surveillance; Attention mechanism (AM); natural language description; person search; Visual Internet of Things (VIoT); NETWORK;

D O I：

10.1109/JIOT.2020.2995148

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual Internet of Things (VIoT) has been widely deployed in the field of social security. However, how to enable it to be intelligent is an urgent yet challenging task. In this article, we address the task of searching persons with natural language description query in a public safety surveillance system, which is a practical and demanding technique in VIoT. It is a fine-grained many-to-many cross-modal problem and more challenging than those with the image and the attribute as queries. The existing attempts are still weak in bridging the semantic gap between visual modality from different camera sensors and text modality from natural language descriptions. We propose a deep person search approach with a natural language description query by employing the attention mechanism (AM) and multimodal alignment (MA) method to supervise the cross-modal mapping. Particularly, the AM consists of two self-attention modules and one cross-attention module, where the former aims at learning discriminative representations and the latter supervises each other with their own information to offer accurate guidance to a common space. The MA approach contains three alignment processes with a novel cross-ranking loss function to make different matching pairs separable in a common space. Extensive experiments on large-scale CUHK-PEDES demonstrate the superiority of the proposed approach.

引用

页码：11147 / 11156

页数：10

共 37 条

[1] A Robust Features-Based Person Tracker for Overhead Views in Industrial Environment
Ahmed, Imran
Ahmad, Awais
Piccialli, Francesco
Sangaiah, Arun Kumar
Jeon, Gwanggil
[J]. IEEE INTERNET OF THINGS JOURNAL, 2018, 5 (03): : 1598 - 1605
[2] [Anonymous], 2015, SIMPLE BASELINE VISU
[3] [Anonymous], 2013, INT C LEARNING REPRE
[4] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[5] Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
Cao, Pengfei
Yang, Zhongyi
Sun, Liang
Liang, Yanchun
Yang, Mary Qu
Guan, Renchu
[J]. NEURAL PROCESSING LETTERS, 2019, 50 (01) : 103 - 119
[6] Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association
Chen, Dapeng
Li, Hongsheng
Liu, Xihui
Shen, Yantao
Shao, Jing
Yuan, Zejian
Wang, Xiaogang
[J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 56 - 73
[7] Improving Text-based Person Search by Spatial Matching and Adaptive Threshold
Chen, Tianlang
Xu, Chenliang
Luo, Jiebo
[J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1879 - 1887
[8] Person Re-Identification by Camera Correlation Aware Feature Augmentation
Chen, Ying-Cong
Zhu, Xiatian
Zheng, Wei-Shi
Lai, Jian-Huang
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (02) : 392 - 408
[9] Deng C., 2019, IEEE T IMAGE PROCESS, V21, P1261
[10] Hyperspectral Images Classification Based on Dense Convolutional Networks with Spectral-Wise Attention Mechanism
Fang, Bei
Li, Ying
Zhang, Haokui
Chan, Jonathan Cheung-Wai
[J]. REMOTE SENSING, 2019, 11 (02)

← 1 2 3 4 →