A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

被引：3

作者：

Ding, Huiming ^{[1
]}

Wang, Sen ^{[1
]}

Xie, Zhifeng ^{[1
,2
]}

Li, Mengtian ^{[1
,2
]}

Ma, Lizhuang ^{[2
,3
]}

机构：

[1] Shanghai Univ, Dept Film & Televis Engn, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China

[2] Shanghai Engn Res Ctr Mot Picture Special Effects, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China

[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, 800 Dongchuan RD Minhang Dist, Shanghai 200240, Peoples R China

来源：

COMPUTERS & GRAPHICS-UK | 2023年 / 115卷

关键词：

Vision and language representation; Graph neural network; Fashion semantic knowledge; Contrastive learning;

D O I：

10.1016/j.cag.2023.07.025

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a regionaware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods. & COPY; 2023 Elsevier Ltd. All rights reserved.

引用

页码：216 / 225

页数：10

共 55 条

[41] Learning Type-Aware Embeddings for Fashion Compatibility
Vasileva, Mariya, I
Plummer, Bryan A.
Dusad, Krishna
Rajpal, Shreya
Kumar, Ranjitha
Forsyth, David
[J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 405 - 421
[42] Veličkovic P, 2018, Arxiv, DOI [arXiv:1710.10903, DOI 10.48550/ARXIV.1710.10903]
[43] Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
Vinyals, Oriol
Toshev, Alexander
Bengio, Samy
Erhan, Dumitru
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) : 652 - 663
[44] Composing Text and Image for Image Retrieval - An Empirical Odyssey
Vo, Nam
Jiang, Lu
Sun, Chen
Murphy, Kevin
Li, Li-Jia
Fei-Fei, Li
Hays, James
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6432 - 6441
[45] Wagner W, 2010, LANG RESOUR EVAL, V44, P421, DOI DOI 10.1007/S10579-010-9124-X
[46] Wu F, 2019, PR MACH LEARN RES, V97
[47] Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Wu, Hui
Gao, Yupeng
Guo, Xiaoxiao
Al-Halah, Ziad
Rennie, Steven
Grauman, Kristen
Feris, Rogerio
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11302 - 11312
[48] Texture-aware and structure-preserving superpixel segmentation
Wu, Jiang
Liu, Chunxiao
Li, Biao
[J]. COMPUTERS & GRAPHICS-UK, 2021, 94 : 152 - 163
[49] Boosting Night-Time Scene Parsing With Learnable Frequency
Xie, Zhifeng
Wang, Sen
Xu, Ke
Zhang, Zhizhong
Tan, Xin
Xie, Yuan
Ma, Lizhuang
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 2386 - 2398
[50] BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions
Xie, Zhifeng
Zhang, Wenling
Sheng, Bin
Li, Ping
Chen, C. L. Philip
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 4499 - 4513

← 1 2 3 4 5 6 →