Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

被引：0

作者：

Fujita, Yoto ^{[1
]}

Nugraha, Aditya Arie ^{[2
]}

Di Carlo, Diego ^{[2
]}

Bando, Yoshiaki ^{[2
,3
]}

Fontaine, Mathieu ^{[2
,4
]}

Yoshii, Kazuyoshi ^{[2
,5
]}

机构：

[1] Kyoto Univ, Grad Sch Informat, Kyoto, Japan

[2] RIKEN, Ctr Adv Intelligence Project AIP, Tokyo, Japan

[3] Natl Inst Adv Ind Sci & Technol, AIRC, Tokyo, Japan

[4] Telecom Paris, LTCI, Paris, France

[5] Kyoto Univ, Grad Sch Engn, Kyoto, Japan

来源：

2024 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC | 2024年

关键词：

speech enhancement; dereverberation; neural beamforming; blind source separation; NONNEGATIVE MATRIX FACTORIZATION;

D O I：

10.1109/APSIPAASC63619.2025.10849318

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).

引用

页数：6

共 29 条

[1]

Aizawa T., 2023, P IEEE INT WORKSH MA, P1

[2]

Chen Z, 2018, IEEE W SP LANG TECH, P558, DOI 10.1109/SLT.2018.8639593

[3]

CHiME-4 Challenge Organizers, CHiME-4 results

[4]

Drude L, 2018, INTERSPEECH, P3043

[5] A generic neural acoustic beamforming architecture for robust multi-channel speech processing [J].

Heymann, Jahn ;

Drude, Lukas ;

Haeb-Umbach, Reinhold .

COMPUTER SPEECH AND LANGUAGE, 2017, 46 :374-385

[6]

Ito N, 2019, INT CONF ACOUST SPEE, P371, DOI [10.1109/ICASSP.2019.8682291, 10.1109/icassp.2019.8682291]

[7] Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization [J].

Kitamura, Daichi ;

Ono, Nobutaka ;

Sawada, Hiroshi ;

Kameoka, Hirokazu ;

Saruwatari, Hiroshi .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (09) :1626-1641

[8]

Kumar A., 2023, P IEEE INT C AC SPEE, P1

[9]

Le Roux J, 2019, INT CONF ACOUST SPEE, P626, DOI 10.1109/ICASSP.2019.8683855

[10] CONDITIONAL DIFFUSION PROBABILISTIC MODEL FOR SPEECH ENHANCEMENT [J].

Lu, Yen-Ju ;

Wang, Zhong-Qiu ;

Watanabe, Shinji ;

Richard, Alexander ;

Yu, Cheng ;

Tsao, Yu .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7402-7406

← 1 2 3 →