Modality Dropout for Improved Performance-driven Talking Faces

被引:23
作者
Abdelaziz, Ahmed Hussen [1 ]
Theobald, Barry-John [1 ]
Dixon, Paul [2 ]
Knothe, Reinhard [2 ]
Apostoloff, Nicholas [1 ]
Kajareker, Sachin [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
[2] Apple, Zurich, Switzerland
来源
PROCEEDINGS OF THE 2020 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2020 | 2020年
关键词
Audio-visual speech synthesis; multimodal processing; facial tracking; blendshape coefficient; 3D talking faces; modality dropout; SPEECH; VIDEO;
D O I
10.1145/3382507.3418840
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-verbal facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g. a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Without modality dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.
引用
收藏
页码:378 / 386
页数:9
相关论文
共 60 条
[1]   Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models [J].
Abdelaziz, Ahmed Hussen ;
Theobald, Barry-John ;
Binder, Justin ;
Fanelli, Gabriele ;
Dixon, Paul ;
Apostoloff, Nicholas ;
Weise, Thibaut ;
Kajareker, Sachin .
ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, :220-225
[2]   Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition [J].
Abdelaziz, Ahmed Hussen .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (03) :475-484
[3]   Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition [J].
Abdelaziz, Ahmed Hussen ;
Zeiler, Steffen ;
Kolossa, Dorothea .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (05) :863-876
[4]  
Afouras T, 2018, Arxiv, DOI arXiv:1809.02108
[5]  
Jalalifar SA, 2018, Arxiv, DOI arXiv:1803.07461
[6]   Expressive Visual Text-To-Speech Using Active Appearance Models [J].
Anderson, Robert ;
Stenger, Bjoern ;
Wan, Vincent ;
Cipolla, Roberto .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :3382-3389
[7]  
[Anonymous], 2011, ACM transactions on graphics (TOG), DOI [DOI 10.1145/1964921.1964972, 10.1145/2010324.1964972]
[8]  
Arevalo J, 2017, Arxiv, DOI [arXiv:1702.01992, DOI 10.48550/ARXIV.1702.01992]
[9]  
Arslan L., 1998, INT C AUD VIS SPEECH, P175
[10]   High-Quality Passive Facial Performance Capture using Anchor Frames [J].
Beeler, Thabo ;
Hahn, Fabian ;
Bradley, Derek ;
Bickel, Bernd ;
Beardsley, Paul ;
Gotsman, Craig ;
Sumner, Robert W. ;
Gross, Markus .
ACM TRANSACTIONS ON GRAPHICS, 2011, 30 (04)