A Voice Trigger System using Keyword and Speaker Recognition for Mobile Devices

被引:12
作者
Lee, Hyeopwoo [1 ]
Chang, Sukmoon [2 ]
Yook, Dongsuk [1 ]
Kim, Yongserk [3 ]
机构
[1] Korea Univ, Dept Comp & Commun Engn, Speech Informat Proc Lab, Seoul 136701, South Korea
[2] Penn State Univ, Middletown, PA 17057 USA
[3] Samsung Elect Co Ltd, Acoust Technol Ctr, Suwon 443742, South Korea
关键词
Voice trigger; keyword recognition; speaker recognition; dynamic time warping; vector quantization; Gaussian mixture model; hidden Markov model; HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; VERIFICATION; IDENTIFICATION; ALGORITHM;
D O I
10.1109/TCE.2009.5373813
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Voice activity detection plays an important role for an efficient voice interface between human and mobile devices, since it can be used as a trigger to activate an automatic speech recognition module of a mobile device. If the input speech signal can be recognized as a predefined magic word coming from a legitimate user, it can be utilized as a trigger. In this paper, we propose a voice trigger system using a keyword-dependent speaker recognition technique. The voice trigger must be able to perform keyword recognition, as well as speaker recognition, without using computationally demanding speech recognizers to properly trigger a mobile device with low computational power consumption. We propose a template based method and a hidden Markov model (HMM) based method for the voice trigger to solve this problem. The experiments using a Korean word corpus show that the template based method performed 4.1 times faster than the HMM based method However, the HMM based method reduced the recognition error by 27.8% relatively compared to the template based method The proposed methods are complementary and can be used selectively depending on the device of interest.(1)
引用
收藏
页码:2377 / 2384
页数:8
相关论文
共 18 条
[1]  
Bhattacharyya A.K., 1943, Bull. Calcutta Math. Soc., V35, P99, DOI DOI 10.1038/157869B0
[2]   Support vector machines for speaker and language recognition [J].
Campbell, WM ;
Campbell, JP ;
Reynolds, DA ;
Singer, E ;
Torres-Carrasquillo, PA .
COMPUTER SPEECH AND LANGUAGE, 2006, 20 (2-3) :210-229
[3]  
Chung H, 2006, IEEE T CONSUM ELECTR, V52, P792
[4]   Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold [J].
Davis, A ;
Nordholm, S ;
Togneri, R .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02) :412-424
[5]   The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective [J].
Doddington, GR ;
Przybocki, MA ;
Martin, AF ;
Reynolds, DA .
SPEECH COMMUNICATION, 2000, 31 (2-3) :225-254
[6]   Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains [J].
Gauvain, Jean-Luc ;
Lee, Chin-Hui .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (02) :291-298
[7]   Maximum a posteriori adaptation of the centroid model for speaker verification [J].
Hautamaki, Ville ;
Kinnunen, Tomi ;
Karkkainen, Ismo ;
Saastamoinen, Juhani ;
Tuononen, Marko ;
Franti, Pasi .
IEEE SIGNAL PROCESSING LETTERS, 2008, 15 (162-165) :162-165
[8]   A Smart Universal Remote Control based on Audio-Visual Device Virtualization [J].
Huang, Hsien-Chao ;
Lin, Ting-Ching ;
Huang, Yueh-Min .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2009, 55 (01) :172-178
[9]   Text-independent speaker identification using soft channel selection in home robot environments [J].
Ji, Mikyong ;
Kim, Sungtak ;
Kim, Hoirin ;
Yoon, Ho-Sub .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2008, 54 (01) :140-144
[10]   Real-time speaker identification and verification [J].
Kinnunen, T ;
Karpov, E ;
Fränti, P .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (01) :277-288