Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

被引：41

作者：

Shimada, Kazuki ^{[1
]}

Bando, Yoshiaki ^{[1
]}

Mimura, Masato ^{[1
]}

Itoyama, Katsutoshi ^{[1
]}

Yoshii, Kazuyoshi ^{[1
,2
]}

Kawahara, Tatsuya ^{[1
]}

机构：

[1] Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan

[2] RIKEN, Ctr Adv Intelligence Project, Tokyo 1030027, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2019年 / 27卷 / 05期

关键词：

Noisy speech recognition; speech enhancement; multichannel nonnegative matrix factorization; beamforming; CONVOLUTIVE MIXTURES; NEURAL-NETWORKS; SEPARATION; SINGLE; MODEL;

D O I：

10.1109/TASLP.2019.2907015

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take a supervised approach that classifies each time-frequency (TF) bin into noise or speech by training a deep neural network (DNN). The performance of ASR, however, is degraded in an unknown noisy environment. To solve this problem, we take an unsupervised approach that decomposes each TF bin into the sum of speech and noise by using multichannel nonnegative matrix factorization (MNMF). This enables us to accurately estimate the SCMs of speech and noise not from observed noisy mixtures but from separated speech and noise components. In this paper, we propose online MVDR beamforming by effectively initializing and incrementally updating the parameters of MNMF. Another main contribution is to comprehensively investigate the performances of ASR obtained by various types of spatial filters, i.e., time-invariant and variant versions of MVDR beamformers and those of rank-1 and full-rank multichannel Wiener filters, in combination with MNMF. The experimental results showed that the proposed method outperformed the state-of-the-art DNN-based beamforming method in unknown environments that did not match training data.

引用

页码：960 / 971

页数：12

共 50 条

[41] Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition
Li, Bo
Sainath, Tara N.
Weiss, Ron J.
Wilson, Kevin W.
Bacchiani, Michiel
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1976 - 1980
[42] EXPLOITING SYNCHRONY SPECTRA AND DEEP NEURAL NETWORKS FOR NOISE-ROBUST AUTOMATIC SPEECH RECOGNITION
Ma, Ning
Marxer, Ricard
Barker, Jon
Brown, Guy J.
2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 490 - 495
[43] An Efficient Noise-Robust Automatic Speech Recognition System using Artificial Neural Networks
Gupta, Santosh
Bhurchandi, Kishor M.
Keskar, Avinash G.
2016 INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), VOL. 1, 2016, : 1873 - 1877
[44] TE-KWS: Text-Informed Speech Enhancement for Noise-Robust Keyword Spotting
Liu, Dong
Mao, Qirong
Gao, Lijian
Ren, Qinghua
Chen, Zhenghan
Dong, Ming
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 601 - 610
[45] MODELLING SPECTRO-TEMPORAL DYNAMICS IN FACTORISATION-BASED NOISE-ROBUST AUTOMATIC SPEECH RECOGNITION
Hurmalainen, Antti
Virtanen, Tuomas
2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4113 - 4116
[46] A Noise-type and Level-dependent MPO-based Speech Enhancement Architecture with Variable Frame Analysis for Noise-robust Speech Recognition
Mitra, Vikramjit
Borgstrom, Bengt J.
Espy-Wilson, Carol Y.
Alwan, Abeer
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2731 - +
[47] On the temporal decorrelation of feature parameters for noise-robust speech recognition
Jung, HY
Lee, SY
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (04): : 407 - 416
[48] Deep Maxout Networks Applied to Noise-Robust Speech Recognition
de-la-Calle-Silos, F.
Gallardo-Antolin, A.
Pelaez-Moreno, C.
ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2014, 2014, 8854 : 109 - 118
[49] Employing Robust Principal Component Analysis for Noise-Robust Speech Feature Extraction in Automatic Speech Recognition with the Structure of a Deep Neural Network
Hung, Jeih-weih
Lin, Jung-Shan
Wu, Po-Jen
APPLIED SYSTEM INNOVATION, 2018, 1 (03) : 1 - 14
[50] MULTI-TASK AUTOENCODER FOR NOISE-ROBUST SPEECH RECOGNITION
Zhang, Haoyi
Liu, Conggui
Inoue, Nakamasa
Shinoda, Koichi
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5599 - 5603

← 1 2 3 4 5 →