Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition

被引:18
作者
Xu, Shilin [1 ,3 ]
Li, Xiangtai [1 ,3 ]
Wang, Jingbo [2 ]
Cheng, Guangliang [3 ]
Tong, Yunhai [1 ]
Tao, Dacheng [4 ]
机构
[1] Peking Univ, Sch Artificial Intelligence, Key Lab Machine Percept, MOE, Beijing, Peoples R China
[2] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[3] SenseTime Res, Beijing, Peoples R China
[4] Univ Sydney, Sydney, NSW, Australia
来源
COMPUTER VISION, ECCV 2022, PT XXXVII | 2022年 / 13697卷
关键词
Human fashion; Fine-grained attribute analysis; Segmentation; Vision transformer;
D O I
10.1007/978-3-031-19836-6_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human fashion understanding is one crucial computer vision task since it has comprehensive information for real-world applications. This focus on joint human fashion segmentation and attribute recognition. Contrary to the previous works that separately model each task as a multi-head prediction problem, our insight is to bridge these two tasks with one unified model via vision transformer modeling to benefit each task. In particular, we introduce the object query for segmentation and the attribute query for attribute prediction. Both queries and their corresponding features can be linked via mask prediction. Then we adopt a two-stream query learning framework to learn the decoupled query representations. We design a novel Multi-Layer Rendering module for attribute stream to explore more fine-grained features. The decoder design shares the same spirit as DETR. Thus we name the proposed method Fahsionformer. Extensive experiments on three human fashion datasets illustrate the effectiveness of our approach. In particular, our method with the same backbone achieve relative 10% improvements than previous works in case of a joint metric (AP(mask) (IoU+F1)) for both segmentation and attribute recognition. To the best of our knowledge, we are the first unified end-to-end vision transformer framework for human fashion analysis. We hope this simple yet effective method can serve as a new flexible baseline for fashion analysis.
引用
收藏
页码:545 / 563
页数:19
相关论文
共 72 条
[1]  
Bolya D., 2019, IEEE T PATTERN ANAL
[2]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[3]   BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [J].
Chen, Hao ;
Sun, Kunyang ;
Tian, Zhi ;
Shen, Chunhua ;
Huang, Yongming ;
Yan, Youliang .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8570-8578
[4]  
Chen K, 2019, Arxiv, DOI [arXiv:1906.07155, DOI 10.48550/ARXIV.1906.07155, 10.48550/arXiv.1906.07155]
[5]   Hybrid Task Cascade for Instance Segmentation [J].
Chen, Kai ;
Pang, Jiangmiao ;
Wang, Jiaqi ;
Xiong, Yu ;
Li, Xiaoxiao ;
Sun, Shuyang ;
Feng, Wansen ;
Liu, Ziwei ;
Shi, Jianping ;
Ouyang, Wanli ;
Loy, Chen Change ;
Lin, Dahua .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4969-4978
[6]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851
[7]   TensorMask: A Foundation for Dense Object Segmentation [J].
Chen, Xinlei ;
Girshick, Ross ;
He, Kaiming ;
Dollar, Piotr .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2061-2069
[8]  
Cheng Bowen, 2021, arXiv
[9]   Instance-Sensitive Fully Convolutional Networks [J].
Dai, Jifeng ;
He, Kaiming ;
Li, Yi ;
Ren, Shaoqing ;
Sun, Jian .
COMPUTER VISION - ECCV 2016, PT VI, 2016, 9910 :534-549
[10]  
De Brabandere B, 2017, Arxiv, DOI arXiv:1708.02551