Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface
被引:0
作者:
Futoshi Asano
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Futoshi Asano
Kiyoshi Yamamoto
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Kiyoshi Yamamoto
Isao Hara
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Isao Hara
Jun Ogata
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Jun Ogata
Takashi Yoshimura
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Takashi Yoshimura
Yoichi Motomura
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Yoichi Motomura
Naoyuki Ichimura
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Naoyuki Ichimura
Hideki Asoh
论文数: 0引用数: 0
h-index: 0
机构:National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
Hideki Asoh
机构:
[1] National Institute of Advanced Industrial Science and Technology,Information Technology Research Institute
[2] Tsukuba University,Department of Computer Science
来源:
EURASIP Journal on Advances in Signal Processing
|
/
2004卷
关键词:
information fusion;
sound localization;
human tracking;
adaptive beamformer;
speech recognition;
D O I:
暂无
中图分类号:
学科分类号:
摘要:
A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.