Multimodal Alignment and Attention-Based Person Search via Natural Language Description

被引:10
作者
Ji, Zhong [1 ]
Li, Shengjia [1 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Natural languages; Visualization; Internet of Things; Cameras; Sensors; Surveillance; Attention mechanism (AM); natural language description; person search; Visual Internet of Things (VIoT); NETWORK;
D O I
10.1109/JIOT.2020.2995148
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Internet of Things (VIoT) has been widely deployed in the field of social security. However, how to enable it to be intelligent is an urgent yet challenging task. In this article, we address the task of searching persons with natural language description query in a public safety surveillance system, which is a practical and demanding technique in VIoT. It is a fine-grained many-to-many cross-modal problem and more challenging than those with the image and the attribute as queries. The existing attempts are still weak in bridging the semantic gap between visual modality from different camera sensors and text modality from natural language descriptions. We propose a deep person search approach with a natural language description query by employing the attention mechanism (AM) and multimodal alignment (MA) method to supervise the cross-modal mapping. Particularly, the AM consists of two self-attention modules and one cross-attention module, where the former aims at learning discriminative representations and the latter supervises each other with their own information to offer accurate guidance to a common space. The MA approach contains three alignment processes with a novel cross-ranking loss function to make different matching pairs separable in a common space. Extensive experiments on large-scale CUHK-PEDES demonstrate the superiority of the proposed approach.
引用
收藏
页码:11147 / 11156
页数:10
相关论文
共 37 条
  • [1] A Robust Features-Based Person Tracker for Overhead Views in Industrial Environment
    Ahmed, Imran
    Ahmad, Awais
    Piccialli, Francesco
    Sangaiah, Arun Kumar
    Jeon, Gwanggil
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2018, 5 (03): : 1598 - 1605
  • [2] [Anonymous], 2015, SIMPLE BASELINE VISU
  • [3] [Anonymous], 2013, INT C LEARNING REPRE
  • [4] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [5] Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
    Cao, Pengfei
    Yang, Zhongyi
    Sun, Liang
    Liang, Yanchun
    Yang, Mary Qu
    Guan, Renchu
    [J]. NEURAL PROCESSING LETTERS, 2019, 50 (01) : 103 - 119
  • [6] Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association
    Chen, Dapeng
    Li, Hongsheng
    Liu, Xihui
    Shen, Yantao
    Shao, Jing
    Yuan, Zejian
    Wang, Xiaogang
    [J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 56 - 73
  • [7] Improving Text-based Person Search by Spatial Matching and Adaptive Threshold
    Chen, Tianlang
    Xu, Chenliang
    Luo, Jiebo
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1879 - 1887
  • [8] Person Re-Identification by Camera Correlation Aware Feature Augmentation
    Chen, Ying-Cong
    Zhu, Xiatian
    Zheng, Wei-Shi
    Lai, Jian-Huang
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (02) : 392 - 408
  • [9] Deng C., 2019, IEEE T IMAGE PROCESS, V21, P1261
  • [10] Hyperspectral Images Classification Based on Dense Convolutional Networks with Spectral-Wise Attention Mechanism
    Fang, Bei
    Li, Ying
    Zhang, Haokui
    Chan, Jonathan Cheung-Wai
    [J]. REMOTE SENSING, 2019, 11 (02)