TOWARDS IMMEDIATE BACKCHANNEL GENERATION USING ATTENTION-BASED EARLY PREDICTION MODEL

被引:11
作者
Adiba, Amalia Istiqlali [1 ]
Homma, Takeshi [1 ]
Miyoshi, Toshinori [1 ]
机构
[1] Res & Dev Grp Hitachi Ltd, Hitachi, Ibaraki, Japan
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
spoken dialogue system; backchannel; ASR delay; early loss; attention;
D O I
10.1109/ICASSP39728.2021.9414193
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Backchannel feedback from the spoken dialogue system makes the human-machine interaction more sophisticated. To predict suitable timing and forms, backchannel prediction technology has been studied. Most studies have combined acoustic and lexical features into the model for better prediction. However, extracting lexical features leads to a delay caused by the automatic speech recognition (ASR) process. To make accurate predictions on the basis of delayed ASR outputs, we propose early prediction for backchannel opportunity and backchannel category based on attention-based LSTM mechanisms. The loss is calculated with a weighting value that gradually increases when a sequence is closer to a suitable response timing. The proposed backchannel prediction uses a two-step approach that first detects a backchannel opportunity and then predicts a backchannel category. Evaluation results show that the early prediction model can predict a backchannel opportunity and category better than the current state-of-the-art algorithm even under a 2.0-second ASR delay condition.
引用
收藏
页码:7408 / 7412
页数:5
相关论文
共 21 条
[1]  
Adiba Amalia Istiqlali, 2020, P INT WORKSH SPOK DI, P129
[2]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[3]  
Den Y., 2011, 2011 Oriental COCOSDA 2011 - International Conference on Speech Database and Assessments, P168, DOI 10.1109/ICSDA.2011.6086001
[4]  
Fujie Shinya, 2005, INTERSPEECH, P889
[5]   Attention Branch Network: Learning of Attention Mechanism for Visual Explanation [J].
Fukui, Hiroshi ;
Hirakawa, Tsubasa ;
Yamashita, Takayoshi ;
Fujiyoshi, Hironobu .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10697-10706
[6]  
Hara K, 2018, INTERSPEECH, P991
[7]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[8]  
Jain A, 2016, IEEE INT CONF ROBOT, P3118, DOI 10.1109/ICRA.2016.7487478
[9]  
Kanda N, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5964, DOI 10.1109/ICASSP.2018.8462619
[10]   Prediction and Generation of Backchannel Form for Attentive Listening Systems [J].
Kawahara, Tatsuya ;
Yamaguchi, Takashi ;
Inoue, Koji ;
Takanashi, Katsuya ;
Ward, Nigel .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2890-2894