A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

被引：3

作者：

Ding, Huiming ^{[1
]}

Wang, Sen ^{[1
]}

Xie, Zhifeng ^{[1
,2
]}

Li, Mengtian ^{[1
,2
]}

Ma, Lizhuang ^{[2
,3
]}

机构：

[1] Shanghai Univ, Dept Film & Televis Engn, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China

[2] Shanghai Engn Res Ctr Mot Picture Special Effects, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China

[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, 800 Dongchuan RD Minhang Dist, Shanghai 200240, Peoples R China

来源：

COMPUTERS & GRAPHICS-UK | 2023年 / 115卷

关键词：

Vision and language representation; Graph neural network; Fashion semantic knowledge; Contrastive learning;

D O I：

10.1016/j.cag.2023.07.025

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a regionaware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods. & COPY; 2023 Elsevier Ltd. All rights reserved.

引用

页码：216 / 225

页数：10

共 55 条

[1] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[2] Named Entity Recognition and Relation Extraction with Graph Neural Networks in Semi Structured Documents
Carbonell, Manuel
Riba, Pau
Villegas, Mauricio
Fornes, Alicia
Llados, Josep
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9622 - 9627
[3] Cross-domain retrieving sketch and shape using cycle CNNs
Chen, Mingjia
Wang, Changbo
Liu, Ligang
[J]. COMPUTERS & GRAPHICS-UK, 2020, 89 : 50 - 58
[4] Image Search with Text Feedback by Visiolinguistic Attention Learning
Chen, Yanbei
Gong, Shaogang
Bazzani, Loris
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2998 - 3008
[5] Cho K., 2014, P 2014 C EMP METH NA, DOI 10.3115/v1/d14-1179
[6] Chuhan Wu, 2020, Chinese Computational Linguistics. 19th China National Conference, CCL 2020. Proceedings. Lecture Notes in Artificial Intelligence. Subseries of Lecture Notes in Computer Science (LNAI 12522), P129, DOI 10.1007/978-3-030-63031-7_10
[7] Dejean H., 2022, arXiv
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Faghri F., 2018, BMVC, P12, DOI DOI 10.1016/S0004-3702(96)00034-3
[10] Gan Z., 2020, ADV NEURAL INFORM PR, V33, P6616

← 1 2 3 4 5 6 →