Estimating the user's state before exchanging utterances using intermediate acoustic features for spoken dialog systems
被引:0
作者:
Chiba, Yuya
论文数: 0引用数: 0
h-index: 0
机构:
Graduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, JapanGraduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, Japan
Chiba, Yuya
[1
]
Nose, Takashi
论文数: 0引用数: 0
h-index: 0
机构:
Graduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, JapanGraduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, Japan
Nose, Takashi
[1
]
Ito, Masashi
论文数: 0引用数: 0
h-index: 0
机构:
Tohoku Institute of Technology, 35-1 Yagiyama-Kasumicho, Taihaku-ku, Sendai-shi, Miyagi,982-8577, JapanGraduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, Japan
Ito, Masashi
[2
]
Ito, Akinori
论文数: 0引用数: 0
h-index: 0
机构:
Graduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, JapanGraduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, Japan
Ito, Akinori
[1
]
机构:
[1] Graduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai,980-8579, Japan
[2] Tohoku Institute of Technology, 35-1 Yagiyama-Kasumicho, Taihaku-ku, Sendai-shi, Miyagi,982-8577, Japan
The spoken dialog system (SDS) is an example of a speech interface and has been included in several devices to help users operate the system. The SDS is beneficial for the user because it does not restrict the style of the user's input utterances, but sometimes makes it difficult to speak to the system. Conventional systems cannot give appropriate help to a user who does not make explicit input utterances since these systems have to recognize and parse a user's utterance in order to decide the next prompt. Therefore, the system should estimate the state of the user upon encountering a problem in order to start the dialog and provide appropriate help before the user abandons the dialog. Based on this assumption, we aim to construct a system which responds to a user who does not speak to the system. In this research, we defined two basic states of the user when the user does not speak to the system: the user is embarrassed by the prompt, or is thinking about how to answer the prompt. We discriminated these user states by using intermediate acoustic features and the facial orientation of the user. Our previous approach used several intermediate acoustic features determined manually, and it was not possible to discriminate the user's state automatically. Therefore, the present paper examines a method to extract intermediate acoustic features from low-level features, such as MFCC, log F0, and zero cross counting (ZCC). We introduce a new annotation rule, and compare the discrimination performance with the previous feature set. Finally, the user's state was discriminated by using the combination of intermediate acoustic features and facial orientation.