An analysis of the effect of combining standard and alternate sensor signals on recognition of syllabic units for multimodal speech recognition

被引:6
作者
Radha, N. [1 ]
Shahina, A. [1 ]
Prabha, P. [1 ]
Sri, Preethi B. T. [1 ]
Khan, Nayeemulla A. [2 ]
机构
[1] SSN Coll Engn, Dept Informat Technol, Kalavakkam 603110, India
[2] VIT Univ, Sch Comp Sci & Engn, Madras 600127, Tamil Nadu, India
关键词
Multimodal speech recognition; Throat microphone; Lip reading; Hidden Markov models; COUPLED-HMM; LIP-MOTION;
D O I
10.1016/j.patrec.2017.10.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies the effect of combining evidences from multiple modes of speech on the recognition of different categories of sounds. Multimodal speech recognition systems are built by combining the acoustic and visual cues from the (lip radiated) normal microphone speech, throat microphone speech and lip reading for the recognition of the highly confusable 145 consonant-vowel units of the Hindi language. The performance of the multimodal systems are compared with that of the unimodal systems for the recognition of sounds based on their place (POA) and manner of articulation (MOA) as well as their associated vowels. This comparison shows that though the multimodal ASR systems rely on the presence of complimentary speech-related acoustic and visual cues present in the different modes, not all evidences are complimentary. Bimodal systems that combines visual cues from lip reading are shown to improve the recognition of sounds based on POA and MOA, but decrease the recognition of vowels. This study shows that, compared to the standard Automatic Speech Recognition(ASR) system, the best multimodal system that combines the two acoustic cues as well as visual cue improves the recognition of POA category by 11%, MOA category by 3% and vowels by 2%. However, the study shows the need for exploring better fusion techniques to overcome absence of complementary evidences in certain categories of sounds, especially in bimodal systems. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:39 / 49
页数:11
相关论文
共 23 条
[1]   Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition [J].
Abdelaziz, Ahmed Hussen ;
Zeiler, Steffen ;
Kolossa, Dorothea .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (05) :863-876
[2]  
Bo Z., 2008, THESIS
[3]  
Campbell W. M., 2003, MULTIMODAL SPEAKER A
[4]   Multimodal speaker/speech recognition using lip motion, lip texture and audio [J].
Cetingul, H. E. ;
Erzin, E. ;
Yemez, Y. ;
Tekalp, A. M. .
SIGNAL PROCESSING, 2006, 86 (12) :3549-3558
[5]  
Cetingül HE, 2004, IEEE IMAGE PROC, P2023
[6]  
Deng L, 2004, 2004 IEEE 6TH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, P19
[7]   BioID: A multimodal biometric identification system [J].
Frischholz, RW ;
Dieckmann, U .
COMPUTER, 2000, 33 (02) :64-+
[8]  
Gangashetty S. V., 2003, P IEEE INT C AC SPEE, V3, P201
[9]  
Gurban M., 2008, P 10 INT C MULT INT
[10]  
Heracleous P., 2007, EURASIP J ADV SIG PR, V2007, P56