A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

被引:3
作者
Ding, Huiming [1 ]
Wang, Sen [1 ]
Xie, Zhifeng [1 ,2 ]
Li, Mengtian [1 ,2 ]
Ma, Lizhuang [2 ,3 ]
机构
[1] Shanghai Univ, Dept Film & Televis Engn, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China
[2] Shanghai Engn Res Ctr Mot Picture Special Effects, 149 Yanchang RD Jingan Dist, Shanghai 200072, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, 800 Dongchuan RD Minhang Dist, Shanghai 200240, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2023年 / 115卷
关键词
Vision and language representation; Graph neural network; Fashion semantic knowledge; Contrastive learning;
D O I
10.1016/j.cag.2023.07.025
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a regionaware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods. & COPY; 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页码:216 / 225
页数:10
相关论文
共 55 条
  • [41] Learning Type-Aware Embeddings for Fashion Compatibility
    Vasileva, Mariya, I
    Plummer, Bryan A.
    Dusad, Krishna
    Rajpal, Shreya
    Kumar, Ranjitha
    Forsyth, David
    [J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 405 - 421
  • [42] Veličkovic P, 2018, Arxiv, DOI [arXiv:1710.10903, DOI 10.48550/ARXIV.1710.10903]
  • [43] Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
    Vinyals, Oriol
    Toshev, Alexander
    Bengio, Samy
    Erhan, Dumitru
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) : 652 - 663
  • [44] Composing Text and Image for Image Retrieval - An Empirical Odyssey
    Vo, Nam
    Jiang, Lu
    Sun, Chen
    Murphy, Kevin
    Li, Li-Jia
    Fei-Fei, Li
    Hays, James
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6432 - 6441
  • [45] Wagner W, 2010, LANG RESOUR EVAL, V44, P421, DOI DOI 10.1007/S10579-010-9124-X
  • [46] Wu F, 2019, PR MACH LEARN RES, V97
  • [47] Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
    Wu, Hui
    Gao, Yupeng
    Guo, Xiaoxiao
    Al-Halah, Ziad
    Rennie, Steven
    Grauman, Kristen
    Feris, Rogerio
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11302 - 11312
  • [48] Texture-aware and structure-preserving superpixel segmentation
    Wu, Jiang
    Liu, Chunxiao
    Li, Biao
    [J]. COMPUTERS & GRAPHICS-UK, 2021, 94 : 152 - 163
  • [49] Boosting Night-Time Scene Parsing With Learnable Frequency
    Xie, Zhifeng
    Wang, Sen
    Xu, Ke
    Zhang, Zhizhong
    Tan, Xin
    Xie, Yuan
    Ma, Lizhuang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 2386 - 2398
  • [50] BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions
    Xie, Zhifeng
    Zhang, Wenling
    Sheng, Bin
    Li, Ping
    Chen, C. L. Philip
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 4499 - 4513