Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM

被引：25

作者：

Takeuchi, Kenta ^{[1
]}

Hasegawa, Dai ^{[2
]}

Shirakawa, Shinichi ^{[3
]}

Kaneko, Naoshi ^{[2
]}

Sakuta, Hiroshi ^{[2
]}

Sumi, Kazuhiko ^{[2
]}

机构：

[1] Aoyama Gakuin Univ, Grad Sch Sci & Engn, Sagamihara, Kanagawa, Japan

[2] Aoyama Gakuin Univ, Coll Sci & Engn, Sagamihara, Kanagawa, Japan

[3] Yokohama Natl Univ, Fac Environm & Informat Sci, Yokohama, Kanagawa, Japan

来源：

PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON HUMAN AGENT INTERACTION (HAI'17) | 2017年

关键词：

Deep Learning; Gesture Generation; Bi-Directional LSTM; Speech Features;

D O I：

10.1145/3125739.3132594

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this research, we take a first step in generating motion data for gestures directly from speech features. Such a method can make creating gesture animations for Embodied Conversational Agents much easier. We implemented a model using Bi-Directional LSTM taking phonemic features from speech audio data as input to output time sequence data of rotations of bone joints. We assessed the validity of the predicted gesture motion data by evaluating the final loss value of the network, and evaluating the impressions of the predicted gesture by comparing it with the actual motion data that accompanied the audio data used for input and motion data that accompanied a different audio data. The results showed that the accuracy of the prediction for the LSTM model was better than a simple RNN model. In contrast, the impressions evaluation of the predicted gesture was rated lower than the original and mismatched gestures, although individually some predicted gestures were rated the same degree as the mismatched gestures.

引用

页码：365 / 369

页数：5

共 13 条

[1] ABADI M, 2015, TENSORFLOW LARGE SCA, DOI DOI 10.48550/ARXIV.1605.08695
[2] Cassell J, 2001, COMP GRAPH, P477, DOI 10.1145/383259.383315
[3] Cassell Justine., 2007, CONVERSATIONAL INFOR, P133, DOI DOI 10.1002/9780470512470.CH8
[4] Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach
Chiu, Chung-Cheng
Morency, Louis-Philippe
Marsella, Stacy
[J]. INTELLIGENT VIRTUAL AGENTS, IVA 2015, 2015, 9238 : 152 - 166
[5] Chollet Francois., 2015, Keras
[6] Chung-Cheng Chiu, 2011, Intelligent Virtual Agents. Proceedings 11th International Conference, IVA 2011, P127, DOI 10.1007/978-3-642-23974-8_14
[7] Hannun A., 2014, ARXIV
[8] Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[9] Ioffe Sergey, 2015, PROC INT C MACH LEAR, V37, P448, DOI DOI 10.48550/ARXIV.1502.03167
[10] Kingma Diederik P., 2014, arXiv

← 1 2 →