VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search

被引：49

作者：

He, Shuting ^{[1
]}

Luo, Hao ^{[2
]}

Jiang, Wei ^{[1
]}

Jiang, Xudong ^{[3
]}

Ding, Henghui ^{[3
]}

机构：

[1] Zhejiang Univ, Coll Control Sci & Engn, State Key Lab Ind Control Technol, Hangzhou 310027, Peoples R China

[2] DAMO Acad, Alibaba Grp, Hangzhou 310027, Peoples R China

[3] Nanyang Technol Univ NTU, Sch Elect & Elect Engn EEE, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Text-based person search; vision-guided; semantic-group; local cross-modal alignment; semantic-group textual learning; vision-guided knowledge transfer;

D O I：

10.1109/TIP.2023.3337653

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.

引用

页码：163 / 176

页数：14

共 71 条

[1]

Aggarwal S, 2020, IEEE WINT CONF APPL, P2606, DOI [10.1109/wacv45572.2020.9093640, 10.1109/WACV45572.2020.9093640]

[2]

[Anonymous], 2016, P INT C LEARN REPR

[3]

Bird S., 2002, arXiv, P214

[4]

Cao Z., 2007, P 24 INT C MACHINE L, P129

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6]

Chassang A., 2015, ICLR, P1

[7] Improving Text-based Person Search by Spatial Matching and Adaptive Threshold [J].

Chen, Tianlang ;

Xu, Chenliang ;

Luo, Jiebo .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :1879-1887

[8] A Cross-modality and Progressive Person Search System [J].

Chen, Xiaodong ;

Liu, Wu ;

Liu, Xinchen ;

Zhang, Yongdong ;

Mei, Tao .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :4550-4552

[9] Explainable Person Re-Identification with Attribute-guided Metric Distillation [J].

Chen, Xiaodong ;

Liu, Xinchen ;

Liu, Wu ;

Zhang, Xiao-Ping ;

Zhang, Yongdong ;

Mei, Tao .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11793-11802

[10] Cross-Modal Knowledge Adaptation for Language-Based Person Search [J].

Chen, Yucheng ;

Huang, Rui ;

Chang, Hong ;

Tan, Chuanqi ;

Xue, Tao ;

Ma, Bingpeng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :4057-4069

← 1 2 3 4 5 6 7 8 →