Text-to-Image Person Re-Identification Based on Multimodal Graph Convolutional Network

被引：10

作者：

Han, Guang ^{[1
]}

Lin, Min ^{[1
]}

Li, Ziyang ^{[1
]}

Zhao, Haitao ^{[1
]}

Kwong, Sam ^{[2
]}

机构：

[1] Nanjing Univ Posts & Telecommun, Sch Commun & Informat Engn, Nanjing 210003, Peoples R China

[2] Lingnan Univ, Dept Comp & Decis Sci, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Cross-modal retrieval; person re-identification; person search; image-text retrieval; graph convolutional network;

D O I：

10.1109/TMM.2023.3344354

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text-to-image person re-identification (ReID) is a common subproblem in the field of person re-identification and image-text retrieval. Recent approaches generally follow the structure of a dual-stream network, extracting image and text features. There is no deep interaction between images and text in this approach, making it difficult for the network to learn a highly semantic feature representation. In addition, for both image data and text data, the feature extraction process is modeled in a regular way, such as using Transformer to extract sequence embeddings. However, this type of modeling disregards the inherent relationships among multimodal input embeddings. A more flexible approach to mining multimodal data, which uniformly treats the data as graphs, is proposed. In this way, the extraction and interaction of multimodal information are accomplished by means of messages passing between graph nodes. First, a unified multimodal feature extraction and fusion network is proposed based on the graph convolutional network, which enables the progression of multimodal information from 'local' to 'global'. Second, an asymmetric multilevel alignment module, which focuses on more accurate 'local' information from a 'global' perspective, is proposed to progressively divide the multimodal information at each level. Last, a cross-modal representation matching strategy based on similarity distribution and mutual information is proposed to achieve cross-modal alignment. The proposed algorithm in this paper is simple and efficient, and the testing results on three public datasets (CUHK-PEDES, ICFG-PEDES and RSTPReID) show that it can achieve SOTA-level performance.

引用

页码：6025 / 6036

页数：12

共 48 条

[1]

Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1002/ett.70087, 10.1109/CVPR.2018.00636]

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3] Person Re-Identification by Camera Correlation Aware Feature Augmentation [J].

Chen, Ying-Cong ;

Zhu, Xiatian ;

Zheng, Wei-Shi ;

Lai, Jian-Huang .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (02) :392-408

[4] Cross-Modal Knowledge Adaptation for Language-Based Person Search [J].

Chen, Yucheng ;

Huang, Rui ;

Chang, Hong ;

Tan, Chuanqi ;

Xue, Tao ;

Ma, Bingpeng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :4057-4069

[5] TIPCB: A simple but effective part-based convolutional baseline for text-based person search [J].

Chen, Yuhao ;

Zhang, Guoqing ;

Lu, Yujiang ;

Wang, Zhenxing ;

Zheng, Yuhui .

NEUROCOMPUTING, 2022, 494 :171-181

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7]

Dosovitskiy Alexey., 2021, PROC INT C LEARN REP, P2021, DOI [10.48550/ARXIV.2010.11929, DOI 10.48550/ARXIV.2010.11929]

[8] Every Picture Tells a Story: Generating Sentences from Images [J].

Farhadi, Ali ;

Hejrati, Mohsen ;

Sadeghi, Mohammad Amin ;

Young, Peter ;

Rashtchian, Cyrus ;

Hockenmaier, Julia ;

Forsyth, David .

COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+

[9]

Gao CY, 2021, Arxiv, DOI arXiv:2101.03036

[10]

Gao HY, 2019, PR MACH LEARN RES, V97

← 1 2 3 4 5 →