Three-dimensional virtual try-on network based on attention mechanism and vision transformer

被引：0

作者：

Yuan T. ^{[1
]}

Wang X. ^{[1
]}

Luo W. ^{[1
]}

Mei C. ^{[1
]}

Wei J. ^{[1
]}

Zhong Y. ^{[1
,2
]}

机构：

[1] College of Textiles, Donghua University, Shanghai

[2] Key Laboratory of Textile Science and Technology, Ministry of Education, Donghua University, Shanghai

来源：

Fangzhi Xuebao/Journal of Textile Research | 2023年 / 44卷 / 07期

关键词：

attention mechanism; depth estimation; three-dimensional reconstruction; virtual try-on; vision transformer;

D O I：

10.13475/j.fzxb.20220508401

中图分类号：

学科分类号：

摘要：

Objective Three-dimensional (3-D) virtual try-on can provide an intuitive and realistic view for online shopping and has great potential commercial value. However, there are some problems in the existing 3-D virtual try-on network, such as inaccurate generated 3-D human models, unclear model edges and excessive clothing deformation in the virtual fitting, which greatly limit the application of this technology in real scenarios. Method In order to solve the above problems, this research proposed the network named T3D-VTON, a deep neural network introducing convolutional attention mechanism and vision transformer. The network was designed to have three modules: 1) a convolutional block attention module that was added to the feature extraction module to make the network focus on the key information and reduce the influence of irrelevant information; 2) a depth estimation network which was created to adopt an encoder-decoder structure for the establishment of a multiscale neural network combining Resnet and transformer; 3) a feature fusion module that aimed to fuse 2-D and 3-D information to obtain the final 3-D virtual fitting model. The effect of adding the convolution attention mechanism and vision transformer module on the performance of the network was investigated in details. The performance of the network is mainly expressed by the virtual fitting results and the accuracy of the human body model. Qualitative and quantitative comparative analyses were conducted between this experiment and the benchmark network. Results The quantitative experimental results showed that the structure similarity index measure (SSIM) was improved by 0.015 7 compared with the baseline network, and the peak signal-to-noise ratio (PSNR) improved by 0.113 2. The above results indicated that the image generation quality is improved without much loss of information. In terms of human model generation accuracy, compared to the baseline network, absolute relative error was reduced by 0.037 and square relative error was reduced by 0.014 in the results of depth estimation, indicating that the 3-D human model generated by this network was more accurate and the depth in the depth map was more consistent with the original given ground truth. The qualitative experimental results showed that the deformation of the garment fitted the original area of the target body more closely without excessive deformation and reduced the generation of garment artifacts. When dealing with complex textures, the network was able to better preserve the pattern and material of the garment fabric. The generated 3-D human body try-on model showed the front and side effects of the human body model, suggesting that the 3-D human body model generated by the network presents clearer contour edges and effectively eliminates the adhesion between the arms and the abdomen. When the knees are close to each other for example, the network would be able to eliminate the adhesion between the knees. Conclusion The convolutional block attention module and vision transformer introduced by the T3D-VTON network are able to preserve the textural patterns and brand logos on the garment surface when dealing with complex textures. The structure can effectively regulate the garment deformation and blend reasonably with the dressing area of the target character. When generating the 3-D human body model, the network can produce a clearer edge and has more accurate shape generation capability. The method can finally present a 3-D human body model with richer surface texture and more accurate body shape, which provides a fast and economical solution to realize a single image to 3-D virtual application. © 2023 China Textile Engineering Society. All rights reserved.

引用

页码：192 / 198

页数：6

共 19 条

[1]

HAN X T, WU Z X, WU Z, Et al., Viton: an image-based virtual try-on network, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7543-7552, (2018)

[2]

WANG B C, ZHENG H B, LIANG X D, Et al., Toward characteristic-preserving image-based virtual try-on network, Proceedings of the European Conference on Computer Vision (ECCV), pp. 589-604, (2018)

[3]

GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, Et al., Generative adversarial nets, Communications of the ACM, 27, 2, pp. 2672-2680, (2014)

[4]

HONDA S., Viton-gan: virtual try-on image generator trained with adversarial loss, Eurographics on Computer Vision and Pattern Recognition, pp. 9-10, (2019)

[5]

CHOI S, PARK S, LEE M, Et al., Viton-hd: high-resolution virtual try-on via misalignment-aware normalization, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14131-14140, (2021)

[6]

LOPER M, MAHMOOD N, ROMERO J, Et al., Smpl: a skinned multiperson linear model, ACM Transactions on Graphics, 34, 6, pp. 1-16, (2015)

[7]

KANAZAWA A, BLACK M J, JACOBS D W, Et al., End-to-end recovery of human shape and pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122-7131, (2018)

[8]

ZHU H, ZUO X X, WANG S, Et al., Detailed human shape estimation from a single image by hierarchical mesh deformation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4491-4500, (2019)

[9]

SAITO S, HUANG Z, NATSUME R, Et al., Pifu: pixel-aligned implicit function for high-resolution clothed human digitization, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304-2314, (2019)

[10]

HE T, COLLOMOSSE J, JIN H L, Et al., Geo-pifu: geometry and pixel aligned implicit functions for single-view human reconstruction, Advances in Neural Information Processing Systems, 33, pp. 9276-9287, (2020)

← 1 2 →