BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION

被引：6

作者：

Braga, Otavio ^{[1
]}

Siohan, Olivier ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Audio-visual automatic speech recognition; active speaker detection; speaker diarization; multi-task learning;

D O I：

10.1109/ICASSP43922.2022.9746036

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.

引用

页码：6047 / 6051

页数：5

共 24 条

[11]

Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947

[12] TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech [J].

Harte, Naomi ;

Gillen, Eoin .

IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (05) :603-615

[13]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[14] Gradient-based learning applied to document recognition [J].

Lecun, Y ;

Bottou, L ;

Bengio, Y ;

Haffner, P .

PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2278-2324

[15]

Liao H, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P368, DOI 10.1109/ASRU.2013.6707758

[16]

Makino Takaki, 2019, ASRU

[17] Recent advances in the automatic recognition of audiovisual speech [J].

Potamianos, G ;

Neti, C ;

Gravier, G ;

Garg, A ;

Senior, AW .

PROCEEDINGS OF THE IEEE, 2003, 91 (09) :1306-1326

[18] An asynchronous DBN for audio-visual speech recognition [J].

Saenko, Kate ;

Livescu, Karen .

2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, :154-+

[19]

Serdyuk Dmitriy, 2021, AUDIO VISUAL SPEECH

[20] Large-Scale Visual Speech Recognition [J].

Shillingford, Brendan ;

Assael, Yannis ;

Hoffman, Matthew W. ;

Paine, Thomas ;

Hughes, Cian ;

Prabhu, Utsav ;

Liao, Hank ;

Sak, Hasim ;

Rao, Kanishka ;

Bennett, Lorrayne ;

Mulville, Marie ;

Denil, Misha ;

Coppin, Ben ;

Laurie, Ben ;

Senior, Andrew ;

de Freitas, Nando .

INTERSPEECH 2019, 2019, :4135-4139

← 1 2 3 →