Far-Field Automatic Speech Recognition

被引：72

作者：

Haeb-Umbach, Reinhold ^{[1
]}

Heymann, Jahn ^{[2
]}

Drude, Lukas ^{[2
]}

Watanabe, Shinji ^{[3
]}

Delcroix, Marc ^{[4
]}

Nakatani, Tomohiro ^{[4
]}

机构：

[1] Paderborn Univ, Dept Commun Engn, D-33098 Paderborn, Germany

[2] Amazoncom Inc, D-52064 Aachen, Germany

[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[4] NTT Corp, Kyoto, Japan

来源：

PROCEEDINGS OF THE IEEE | 2021年 / 109卷 / 02期

关键词：

Speech recognition; Microphones; Speech enhancement; Reverberation; Robustness; Acoustic beamforming; automatic speech recognition (ASR); dereverberation; end-to-end speech recognition; speech enhancement; DEEP NEURAL-NETWORKS; SOURCE SEPARATION; SPEAKER DIARIZATION; ENHANCEMENT; DEREVERBERATION; MICROPHONE; ROBUST; MODEL; ADAPTATION; CHALLENGE;

D O I：

10.1109/JPROC.2020.3018668

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase in attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile, it has entered the consumer market with digital home assistants with a spoken language interface being its most prominent application. Speech recorded at a distance is affected by various acoustic distortions, and consequently, quite different processing pipelines have emerged compared with ASR for close-talk speech. A signal enhancement front end for dereverberation, source separation, and acoustic beamforming is employed to clean up the speech, and the back-end ASR engine is robustified by multicondition training and adaptation. We will also describe the so-called end-to-end approach to ASR, which is a new promising architecture that has recently been extended to the far-field scenario. This tutorial article gives an account of the algorithms used to enable accurate speech recognition from a distance, and it will be seen that, although deep learning has a significant share in the technological breakthroughs, a clever combination with traditional signal processing can lead to surprisingly effective solutions.

引用

页码：124 / 148

页数：25

共 183 条

[1] Prediction error method for second-order blind identification [J].

AbedMeraim, K ;

Moulines, E ;

Loubaton, P .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 1997, 45 (03) :694-705

[2] IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].

ALLEN, JB ;

BERKLEY, DA .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950

[3] Speaker Diarization: A Review of Recent Research [J].

Anguera Miro, Xavier ;

Bozonnet, Simon ;

Evans, Nicholas ;

Fredouille, Corinne ;

Friedland, Gerald ;

Vinyals, Oriol .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02) :356-370

[4] Acoustic beamforming for speaker diarization of meetings [J].

Anguera, Xavier ;

Wooters, Chuck ;

Hernando, Javier .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022

[5]

[Anonymous], 2000, P 2 INT C LANG RES E

[6]

[Anonymous], [No title captured]

[7]

[Anonymous], P REVERB CHALL WORKS

[8]

Araki S, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5694, DOI 10.1109/ICASSP.2018.8462458

[9]

Araki S, 2016, INT CONF ACOUST SPEE, P385, DOI 10.1109/ICASSP.2016.7471702

[10] On multiplicative transfer function approximation in the short-time Fourier transform domain [J].

Avargel, Yekutiel ;

Cohen, Israel .

IEEE SIGNAL PROCESSING LETTERS, 2007, 14 (05) :337-340

← 1 2 3 4 5 6 7 8 9 10 →