Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition

被引:0
作者
Su, Rongfeng [1 ]
Wang, Lan [1 ]
Liu, Xunying [2 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, CAS Key Lab Human Machine Intelligence Synergy Sy, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China
来源
2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP) | 2017年
基金
中国国家自然科学基金;
关键词
audio-visual speech recognition; multimodal learning; visual feature generation; ESTM; NETWORKS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently various audio-visual speech recognition (AVSR) systems have been developed by using multi modal learning techniques. One key issue is that most of them are based on 21) audio-visual (AV) corpus with the lower video sampling rate. To address this issue, a 3D AV data set with the higher video sampling rate (up to 100 Hz) is introduced to be used in this paper. Another issue is the requirement of both auditory and visual modalities during the system testing. To address this issue, a visual feature generation based bimodal convolutional neural network (CNN) framework is proposed to build an AVSR system with wider application. In this framework, long short-term memory recurrent neural network (LSTM-RNN) is used to generate the visual modality from the auditory modality, while CNNs are used to integrate these two modalities. On a Mandarin Chinese far-field speech recognition task, when visual modality is provided, significant average character error rate (CER) reduction of about 27% relative was obtained over the audio-only CNN baseline. When visual modality is not available, the proposed AVSR system using the visual feature generation technique outperformed the audio-only CNN baseline by 18.52% relative CER.
引用
收藏
页码:40 / 43
页数:4
相关论文
共 23 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]  
[Anonymous], 2016, P 2016 10 INT S CHIN
[3]  
[Anonymous], 2009, HTK BOOK VERSION 3 4
[4]  
Dodd B., 1987, Hearing by Eye: The psychology of lip-reading
[5]   Audio-Visual Speech Modeling for Continuous Speech Recognition [J].
Dupont, Stephane ;
Luettin, Juergen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151
[6]  
Gravier G., 2002, HUMAN LANGUAGE TECHN, P1
[7]  
Hermans M., 2013, Adv. Neural Inf. Process. Syst.., V26
[8]   A fast learning algorithm for deep belief nets [J].
Hinton, Geoffrey E. ;
Osindero, Simon ;
Teh, Yee-Whye .
NEURAL COMPUTATION, 2006, 18 (07) :1527-1554
[9]  
Huang J, 2013, INT CONF ACOUST SPEE, P7596, DOI 10.1109/ICASSP.2013.6639140
[10]  
Luettin J, 2001, INT CONF ACOUST SPEE, P169, DOI 10.1109/ICASSP.2001.940794