ROBUST MULTI-CHANNEL SPEECH RECOGNITION USING FREQUENCY ALIGNED NETWORK

被引:0
作者
Park, Taejin [1 ]
Kumatani, Kenichi [2 ]
Wu, Minhua [2 ]
Sundaram, Shiva [2 ]
机构
[1] Univ Southern Calif USC, Los Angeles, CA 90007 USA
[2] Amazon Inc, Sunnyvale, CA USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
multi-channel acoustic modeling; beamforming; microphone arrays; automatic speech recognition;
D O I
10.1109/icassp40776.2020.9053940
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Conventional speech enhancement technique such as beamforming has known benefits for far-field speech recognition. Our own work in frequency-domain multi-channel acoustic modeling has shown additional improvements by training a spatial filtering layer jointly within an acoustic model. In this paper, we further develop this idea and use frequency aligned network for robust multi-channel automatic speech recognition (ASR). Unlike an affine layer in the frequency domain, the proposed frequency aligned component prevents one frequency bin influencing other frequency bins. We show that this modification not only reduces the number of parameters in the model but also significantly and improves the ASR performance. We investigate effects of frequency aligned network through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
引用
收藏
页码:6859 / 6863
页数:5
相关论文
共 50 条
[31]   A separation and interaction framework for causal multi-channel speech enhancement [J].
Liu, Wenzhe ;
Li, Andong ;
Zheng, Chengshi ;
Li, Xiaodong .
DIGITAL SIGNAL PROCESSING, 2022, 126
[32]   On using Parameterized Multi-channel Non-causal Wiener Filter-Adapted Convolutional Neural Networks for Distant Speech Recognition [J].
Lee, Jeehye ;
Chang, Joon-Hyuk ;
Sohn, Jinho .
2016 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATIONS (ICEIC), 2016,
[33]   Multi-Channel to Multi-Channel Noise Reduction and Reverberant Speech Preservation in Time-Varying Acoustic Scenes for Binaural Reproduction [J].
Lugasi, Moti ;
Donley, Jacob ;
Menon, Anjali ;
Tourbabin, Vladimir ;
Rafaely, Boaz .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :3283-3295
[34]   Real-time Multi-channel Speech Enhancement Based on Neural Network Masking with Attention Model [J].
Xue, Cheng ;
Huang, Weilong ;
Chen, Weiguang ;
Feng, Jinwei .
INTERSPEECH 2021, 2021, :1862-1866
[35]   A new framework for robust speech recognition in complex channel environments [J].
He, Yongjun ;
Han, Jiqing ;
Zheng, Tieran ;
Sun, Guanglu .
DIGITAL SIGNAL PROCESSING, 2014, 32 :109-123
[36]   COMBINING DEEP NEURAL NETWORKS AND BEAMFORMING FOR REAL-TIME MULTI-CHANNEL SPEECH ENHANCEMENT USING A WIRELESS ACOUSTIC SENSOR NETWORK [J].
Ceolini, Enea ;
Liu, Shih-Chii .
2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
[37]   Neural Network based Regression for Robust Overlapping Speech Recognition using Microphone Arrays [J].
Li, Weifeng ;
Dines, John ;
Magimai-Doss, Mathew ;
Bourlard, Herve .
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, :2012-2015
[38]   Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures [J].
Moore, A. H. ;
Parada, P. Peso ;
Naylor, P. A. .
COMPUTER SPEECH AND LANGUAGE, 2017, 46 :574-584
[39]   Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition [J].
Li, Bo ;
Sainath, Tara N. ;
Weiss, Ron J. ;
Wilson, Kevin W. ;
Bacchiani, Michiel .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :1976-1980
[40]   AN INVESTIGATION INTO THE MULTI-CHANNEL TIME DOMAIN SPEAKER EXTRACTION NETWORK [J].
Zorila, Catalin ;
Li, Mohan ;
Doddipatla, Rama .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :793-800