Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

被引:117
作者
Haeb-Umbach, Reinhold [1 ]
Watanabe, Shinji [2 ,3 ,4 ,5 ,7 ]
Nakatani, Tomohiro [3 ,4 ,6 ]
Bacchiani, Michiel [8 ,9 ,10 ]
Hoffmeister, Bjoern [11 ]
Seltzer, Michael L. [12 ,13 ]
Zen, Heiga [14 ,15 ]
Souden, Mehrez [3 ,16 ,17 ]
机构
[1] Int Speech Commun Assoc, Bonn, Germany
[2] Johns Hopkins Univ, Baltimore, MD USA
[3] Nippon Telegraph & Tel Commun Sci Labs, Kyoto, Japan
[4] Georgia Inst Technol, Atlanta, GA 30332 USA
[5] Mitsubishi Elect Res Labs, Cambridge, MA USA
[6] Nagoya Univ, Nagoya, Aichi, Japan
[7] IEEE, Signal Proc Soc Speech & Language Proc Tech Comm, Piscataway, NJ USA
[8] Google Tokyo, Res Grp, Minato City, Japan
[9] IBM Res, Yorktown Hts, NY USA
[10] AT&T Labs Res, Florham Pk, NJ USA
[11] Amazon, Seattle, WA USA
[12] Facebook, Appl Machine Learning Div, Cambridge, MA USA
[13] Carnegie Mellon Univ, Robust Speech Recognit Grp, Pittsburgh, PA 15213 USA
[14] Google, Mountain View, CA USA
[15] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY USA
[16] Apple Inc, Interact Media Grp, Cupertino, CA 95014 USA
[17] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
基金
英国工程与自然科学研究理事会;
关键词
Microphones; Speech recognition; Speech processing; Loudspeakers; Reverberation; NEURAL-NETWORKS; SEPARATION; DEREVERBERATION; RECOGNITION;
D O I
10.1109/MSP.2019.2918706
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity today. This success has been made possible by major advancements in signal processing and machine learning for so-called far-field speech recognition, where the commands are spoken at a distance from the sound-capturing device. The challenges encountered are quite unique and different from many other use cases of automatic speech recognition (ASR). The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants. These technologies include multichannel acoustic echo cancellation (MAEC), microphone array processing and dereverberation techniques for signal enhancement, reliable wake-up word and end-of-interaction detection, and high-quality speech synthesis as well as sophisticated statistical models for speech and language, learned from large amounts of heterogeneous training data. In all of these fields, deep learning (DL) has played a critical role.
引用
收藏
页码:111 / 124
页数:14
相关论文
共 60 条
[1]  
Aleksic P, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P468
[2]   IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].
ALLEN, JB ;
BERKLEY, DA .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950
[3]  
[Anonymous], P EURASIP J ADV SIGN
[4]  
[Anonymous], P IEEE INT C AC SPEE
[5]  
[Anonymous], MACH LEARN J
[6]  
[Anonymous], MACH LEARN J
[7]  
[Anonymous], P INTERSPEECH
[8]  
[Anonymous], P INTERSPEECH
[9]  
[Anonymous], 2007, Multichannel Speech Process. Handb.
[10]  
[Anonymous], 2018, P INT 2018