A ROBUST AND REAL-TIME VISUAL SPEECH RECOGNITION FOR SMARTPHONE APPLICATION

被引：0

作者：

Song, Min Gyu ^{[1
]}

Tariquzzamani, Md ^{[1
]}

Kim, Jin Young ^{[1
]}

Hwang, Seong Taek ^{[2
]}

Chi, Seung Ho ^{[3
]}

机构：

[1] Chonnam Natl Univ, Sch Elect & Comp Engn, Kwangju 500757, South Korea

[2] Samsung Elect, Multimedia Lab, IT Ctr, Commun Res Ctr, Suwon 442600, South Korea

[3] Dongshin Univ, Informat Ctr, Dept Comp Sci, Naju 520714, Chonnam, South Korea

来源：

INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL | 2012年 / 8卷 / 04期

关键词：

Visual speech recognition; Lip localization; K-means clustering; Histogram matching; Lip folding; RASTA filter; FEATURE-EXTRACTION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual speech recognition (VSR) is one prospective complementary approach for speech recognition under very noisy environments, especially in mobile phone circumstances. In implementing visual speech recognition on a smartphone, the two main issues of real-time responsiveness and robustness conflict with each other. In this paper we proposed and implemented a robust visual speech recognition system that performs in real-time. First, we devised a robust and fast lip detection method based on eye-detection, which is not vulnerable to changes in illumination. The pair of eyes was determined based on image binarization and a coupled-eye validation method. Then the lip region was estimated by geometric lip candidate detection and k-means clustering. Second, to cope with the problem of lighting-dependent visual speech recognition performance, we combined the previous methods of lip-folding and RASTA filtering and introduced a modified histogram equalization, in which a mapping function was calculated for the first frame image and fixed through the following images. Third, the visual speech recognition system with 32 control words was implemented on a smartphone with code optimization. It was shown to work in real-time with promising results.

引用

页码：2837 / 2853

页数：17

共 26 条

[1] EFFECTIVENESS OF LINEAR PREDICTION CHARACTERISTICS OF SPEECH WAVE FOR AUTOMATIC SPEAKER IDENTIFICATION AND VERIFICATION [J].

ATAL, BS .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1974, 55 (06) :1304-1312

[2] Graphical model architectures for speech recognition [J].

Bilmes, JA ;

Bartels, C .

IEEE SIGNAL PROCESSING MAGAZINE, 2005, 22 (05) :89-100

[3] A maximum A posteriori approach to speaker adaptation using the trended hidden Markov model [J].

Chengalvarayan, R ;

Deng, L .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2001, 9 (05) :549-557

[4] A review of speech-based bimodal recognition [J].

Chibelushi, CC ;

Deravi, F ;

Mason, JSD .

IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4 (01) :23-37

[5] Audio-Visual Speech Modeling for Continuous Speech Recognition [J].

Dupont, Stephane ;

Luettin, Juergen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151

[6]

Eyeno N., 1993, P IEEE INT C AC SPEE, P557

[7] Robust distributed speech recognition using speech enhancement [J].

Flynn, Ronan ;

Jones, Edward .

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2008, 54 (03) :1267-1273

[8] CEPSTRAL ANALYSIS TECHNIQUE FOR AUTOMATIC SPEAKER VERIFICATION [J].

FURUI, S .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1981, 29 (02) :254-272

[9]

Gonzalez R. C., 1992, DIGITAL IMAGE PROCES, V2nd

[10]

Gowdy JN, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P993

← 1 2 3 →