RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION

被引:0
作者
Makino, Takaki [1 ]
Liao, Hank [1 ]
Assael, Yannis [2 ]
Shillingford, Brendan [2 ]
Garcia, Basilio [1 ]
Braga, Otavio [1 ]
Siohan, Olivier [1 ]
机构
[1] Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA
[2] DeepMind, 6 Pancras Sq, London N1C 4AG, England
来源
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年
关键词
Audio-visual speech recognition; recurrent neural network transducer;
D O I
10.1109/asru46091.2019.9004036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.
引用
收藏
页码:905 / 912
页数:8
相关论文
共 33 条
[21]  
[Anonymous], INTERSPEECH
[22]  
[Anonymous], 1993, SPEECH COMMUNICATION
[23]  
[Anonymous], IEEE J SPEECH AUDIO
[24]  
[Anonymous], 2017, BRIT MACH VIS C
[25]  
Ba Jimmy, 2016, ABS160706450 ARXIV
[26]  
Cho Kyunghyun, 2014, EMPIRICAL METHODS NA
[27]   Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].
Ephrat, Ariel ;
Mosseri, Inbar ;
Lang, Oran ;
Dekel, Tali ;
Wilson, Kevin ;
Hassidim, Avinatan ;
Freeman, William T. ;
Rubinstein, Michael .
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)
[28]  
Kingma D.P., 2015, P INT C LEARNING REP
[29]   Speech recognition and sensory integration [J].
Massaro, DW ;
Stork, DG .
AMERICAN SCIENTIST, 1998, 86 (03) :236-244
[30]   Large-Scale Visual Speech Recognition [J].
Shillingford, Brendan ;
Assael, Yannis ;
Hoffman, Matthew W. ;
Paine, Thomas ;
Hughes, Cian ;
Prabhu, Utsav ;
Liao, Hank ;
Sak, Hasim ;
Rao, Kanishka ;
Bennett, Lorrayne ;
Mulville, Marie ;
Denil, Misha ;
Coppin, Ben ;
Laurie, Ben ;
Senior, Andrew ;
de Freitas, Nando .
INTERSPEECH 2019, 2019, :4135-4139