Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN

被引：0

作者：

Zhu Z. ^{[1
,2
]}

Luo C. ^{[2
]}

He Q. ^{[1
]}

Peng W. ^{[2
]}

Mao Z. ^{[2
]}

Zhang S. ^{[3
]}

机构：

[1] Audio，Speech and Vision Processing Laboratory, South China University of Technology, Guangdong, Guangzhou

[2] School of Cyber Security, Guangdong Polytechnic Normal University, Guangdong, Guangzhou

[3] Guangzhou Quwan Network Technology Co. ，Ltd., Guangdong, Guangzhou

来源：

Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science) | 2023年 / 51卷 / 05期

基金：

中国国家自然科学基金;

关键词：

consistency judgment; convolutional neural network; frontal reconstruction; generative adversarial network; multi-modal;

D O I：

10.12141/j.issn.1000-565X.220435

中图分类号：

学科分类号：

摘要：

The traditional consistency judgment methods of lip motion and voice mainly focus on processing the frontal lip motion video，without considering the impact of angle changes on the result during the video acquisition process. In addition, they are prone to ignoring the spatio-temporal characteristics of the lip movement process．Aiming at these problems, this paper focused on the influence of lip angle changes on consistency judgment，combined the advantages of three dimensional convolutional neural networks for non-linear representation and spatio-temporal dimensional feature extraction, and proposed a multi-view lip motion and voice consistency judgment method based on frontal lip reconstruction and three dimensional （3D）coupled convolutional neural network．Firstly，the self-mapping loss was introduced into the generator to improve the effect of frontal reconstruction, and then the lip reconstruction method based on self-mapping supervised cycle-consistent generative adversarial network (SMS-CycleGAN) was used for angle classification and frontal reconstruction of multi-view lip image． Secondly，two heterogeneous three dimensional convolution neural networks were designed to describe the audio and video signals respectively, and then the 3D convolution features containing long-term spatio-temporal correlation information were extracted．Finally, the contrastive loss function was introduced as the correlation discrimination measure of audio and video signal matching, and the output of the audio-video network was coupled into the same representation space for consistency judgment. The experimental results show that the method proposed in this paper can reconstruct frontal lip images of higher quality，and it is better than a variety of comparison methods on the performance of consistency judgment． © 2023 South China University of Technology. All rights reserved.

引用

页码：70 / 77

页数：7

共 22 条

[1] DEBNATH S，, RAMALAKSHMI K，, SENBAGAVALLI M．, Multimodal authentication system based on audiovisual data：a review ［C］, Proceedings of 2022 International Conference for Advancement in Technology, pp. 1-5, (2022)
[2] A multimodal saliency model for videos with high audio-visual correspondence ［J］．, IEEE Transactions on Image Processing, 29, pp. 3805-3819, (2020)
[3] ZHANG S X, An overview of deep-learning-based audio-visual speech enhancement and separation, IEEE/ACM Transactions on Audio， Speech， and Language Processing, 29, pp. 1368-1396, (2021)
[4] SUGIYAMA M．, Minimum dependency key frames selection via quadratic mutual information ［C］, Proceedings of 2015 the Tenth International Conference on Digital Information Managemen, pp. 148-153, (2015)
[5] ZHU Zheng-yu, HE Qian-hua, FENG Xiao-hui, Lip motion and voice consistency algorithm based on fusing spatiotemporal correlation degree ［J］, Acta Electronica Sinica, 42, 4, pp. 779-785, (2014)
[6] KUMAR K，, NAVRATIL J, Audio-visual speech synchronization detection using a bimodal linear prediction model ［C］, Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 53-59, (2009)
[7] HE Qianhua, ZHU Zhengyu, FENG Xiaohui, Lip motion and voice consistency analysis algorithm based on shift-invariant dictionary, Journal of Huazhong University of Science and Technology（Natural Science Edition）, 43, 10, pp. 69-74, (2015)
[8] CHUNG J S，, ZISSERMAN A．, Lip reading in profile ［C］, Proceedings of 2017 British Machine Vision Conference, pp. 36-46, (2017)
[9] KIKUCHI T，, OZASA Y．, Watch， listen once， and sync： audio-visual synchronization with multi-modal regression CNN ［C］, Proceedings of 2018 IEEE International Conference on Acoustics， Speech and Signal Processing, pp. 3036-3040, (2018)
[10] CHENG S, Towards pose-invariant lip-reading ［C］, Proceedings of 2020 IEEE International Conference on Acoustics，Speech and Signal Processing, pp. 4357-4361, (2020)

← 1 2 3 →