Acoustic to articulatory mapping with deep neural network

被引:20
作者
Wu, Zhiyong [1 ,2 ,4 ,5 ]
Zhao, Kai [1 ,2 ,4 ,5 ]
Wu, Xixin [1 ,2 ,3 ,4 ,5 ]
Lan, Xinyu [1 ,2 ,4 ,5 ]
Meng, Helen [1 ,2 ,3 ]
机构
[1] Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Grad Sch Shenzhen, Shenzhen 518055, Peoples R China
[2] Tsinghua Univ, Shenzhen Key Lab Informat Sci & Technol, Grad Sch Shenzhen, Shenzhen 518055, Peoples R China
[3] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[4] Tsinghua Univ, TNList, Shenzhen 518055, Peoples R China
[5] Tsinghua Univ, Dept Comp Sci & Technol, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Acoustic to articulatory mapping; Audio-visual mapping; Deep neural network (DNN); Speech driven talking avatar; MOVEMENTS;
D O I
10.1007/s11042-014-2183-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Synthetic talking avatar has been demonstrated to be very useful in human-computer interactions. In this paper, we discuss the problem of acoustic to articulatory mapping and explore different kinds of models to describe the mapping function. We try general linear model (GLM), Gaussian mixture model (GMM), artificial neural network (ANN) and deep neural network (DNN) for the problem. Taking the advantage of neural network that its prediction stage can be finished in a very short time (e.g. real-time), we develop a real-time speech driven talking avatar system based on DNN. The input of the system is acoustic speech and the output is articulatory movements (that are synchronized with the input speech) on a three-dimensional avatar. Several experiments are conducted to compare the performance of GLM, GMM, ANN and DNN on a well known acoustic-articulatory English speech corpus MNGU0. Experimental results demonstrate that the proposed acoustic to articulatory mapping method with DNN can achieve the best performance.
引用
收藏
页码:9889 / 9907
页数:19
相关论文
共 32 条
[1]  
[Anonymous], P EUR
[2]  
[Anonymous], 2009, ENCY BIOMETRICS
[3]  
Cassell J, 2001, AI MAG, V22, P67
[4]   Lifelike talking faces for interactive services [J].
Cosatto, E ;
Ostermann, J ;
Graf, HP ;
Schroeter, J .
PROCEEDINGS OF THE IEEE, 2003, 91 (09) :1406-1429
[5]  
Deng L., 2011, PROC ASIAN PACIFIC S, P1
[6]   Head motion synthesis from speech using deep neural networks [J].
Ding, Chuang ;
Xie, Lei ;
Zhu, Pengcheng .
MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (22) :9871-9888
[7]   Training products of experts by minimizing contrastive divergence [J].
Hinton, GE .
NEURAL COMPUTATION, 2002, 14 (08) :1771-1800
[8]   A fast learning algorithm for deep belief nets [J].
Hinton, Geoffrey E. ;
Osindero, Simon ;
Teh, Yee-Whye .
NEURAL COMPUTATION, 2006, 18 (07) :1527-1554
[9]   To recognize shapes, first learn to generate images [J].
Hinton, Geoffrey E. .
COMPUTATIONAL NEUROSCIENCE: THEORETICAL INSIGHTS INTO BRAIN FUNCTION, 2007, 165 :535-547
[10]  
Hiroya S, 2002, INT CONF ACOUST SPEE, P437