A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

被引:3
作者
Ding, Huiming [1 ]
Wang, Sen [1 ]
Xie, Zhifeng [1 ,2 ]
Li, Mengtian [1 ,2 ]
Ma, Lizhuang [2 ,3 ]
机构
[1] Shanghai Univ, Dept Film & Televis Engn, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China
[2] Shanghai Engn Res Ctr Mot Picture Special Effects, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, 800 Dongchuan RD Minhang Dist, Shanghai 200240, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2023年 / 115卷
关键词
Vision and language representation; Graph neural network; Fashion semantic knowledge; Contrastive learning;
D O I
10.1016/j.cag.2023.07.025
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a regionaware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods. & COPY; 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页码:216 / 225
页数:10
相关论文
共 55 条
  • [1] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [2] Named Entity Recognition and Relation Extraction with Graph Neural Networks in Semi Structured Documents
    Carbonell, Manuel
    Riba, Pau
    Villegas, Mauricio
    Fornes, Alicia
    Llados, Josep
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9622 - 9627
  • [3] Cross-domain retrieving sketch and shape using cycle CNNs
    Chen, Mingjia
    Wang, Changbo
    Liu, Ligang
    [J]. COMPUTERS & GRAPHICS-UK, 2020, 89 : 50 - 58
  • [4] Image Search with Text Feedback by Visiolinguistic Attention Learning
    Chen, Yanbei
    Gong, Shaogang
    Bazzani, Loris
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2998 - 3008
  • [5] Cho K., 2014, P 2014 C EMP METH NA, DOI 10.3115/v1/d14-1179
  • [6] Chuhan Wu, 2020, Chinese Computational Linguistics. 19th China National Conference, CCL 2020. Proceedings. Lecture Notes in Artificial Intelligence. Subseries of Lecture Notes in Computer Science (LNAI 12522), P129, DOI 10.1007/978-3-030-63031-7_10
  • [7] Dejean H., 2022, arXiv
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] Faghri F., 2018, BMVC, P12, DOI DOI 10.1016/S0004-3702(96)00034-3
  • [10] Gan Z., 2020, ADV NEURAL INFORM PR, V33, P6616