Far-Field Automatic Speech Recognition

被引:72
作者
Haeb-Umbach, Reinhold [1 ]
Heymann, Jahn [2 ]
Drude, Lukas [2 ]
Watanabe, Shinji [3 ]
Delcroix, Marc [4 ]
Nakatani, Tomohiro [4 ]
机构
[1] Paderborn Univ, Dept Commun Engn, D-33098 Paderborn, Germany
[2] Amazoncom Inc, D-52064 Aachen, Germany
[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[4] NTT Corp, Kyoto, Japan
关键词
Speech recognition; Microphones; Speech enhancement; Reverberation; Robustness; Acoustic beamforming; automatic speech recognition (ASR); dereverberation; end-to-end speech recognition; speech enhancement; DEEP NEURAL-NETWORKS; SOURCE SEPARATION; SPEAKER DIARIZATION; ENHANCEMENT; DEREVERBERATION; MICROPHONE; ROBUST; MODEL; ADAPTATION; CHALLENGE;
D O I
10.1109/JPROC.2020.3018668
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase in attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile, it has entered the consumer market with digital home assistants with a spoken language interface being its most prominent application. Speech recorded at a distance is affected by various acoustic distortions, and consequently, quite different processing pipelines have emerged compared with ASR for close-talk speech. A signal enhancement front end for dereverberation, source separation, and acoustic beamforming is employed to clean up the speech, and the back-end ASR engine is robustified by multicondition training and adaptation. We will also describe the so-called end-to-end approach to ASR, which is a new promising architecture that has recently been extended to the far-field scenario. This tutorial article gives an account of the algorithms used to enable accurate speech recognition from a distance, and it will be seen that, although deep learning has a significant share in the technological breakthroughs, a clever combination with traditional signal processing can lead to surprisingly effective solutions.
引用
收藏
页码:124 / 148
页数:25
相关论文
共 183 条
[11]  
Barker J., 2018, P ANN C INT SPEECH C
[12]   The third 'CHIME' speech separation and recognition challenge: Analysis and outcomes [J].
Barker, Jon ;
Marxer, Ricard ;
Vincent, Emmanuel ;
Watanabe, Shinji .
COMPUTER SPEECH AND LANGUAGE, 2017, 46 :605-626
[13]   Multi-microphone speech recognition in everyday environments [J].
Barker, Jon ;
Marxer, Ricard ;
Vincent, Emmanuel ;
Watanabe, Shinji .
COMPUTER SPEECH AND LANGUAGE, 2017, 46 :386-387
[14]   Distributed Node-Specific LCMV Beamforming in Wireless Sensor Networks [J].
Bertrand, Alexander ;
Moonen, Marc .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2012, 60 (01) :233-246
[15]   A anthropological critique of the Ethic of care [J].
Bertrand, Alienor .
RECHERCHE EN SOINS INFIRMIERS, 2011, (104) :5-22
[16]  
Boeddeker C., 2020, P IEEE INT C AC SPEE, P226
[17]  
Boeddeker C, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6697, DOI 10.1109/ICASSP.2018.8461669
[18]  
Boeddeker Christoph, 2018, P CHIME5 WORKSH
[19]   Evaluation and Comparison of Late Reverberation Power Spectral Density Estimators [J].
Braun, Sebastian ;
Kuklasinski, Adam ;
Schwartz, Ofer ;
Thiergart, Oliver ;
Habets, Emanuel A. P. ;
Gannot, Sharon ;
Doclo, Simon ;
Jensen, Jesper .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (06) :1052-1067
[20]  
Braun S, 2018, INTERSPEECH, P17