A Joint Learning Algorithm for Complex-Valued T-F Masks in Deep Learning-Based Single-Channel Speech Enhancement Systems

被引：20

作者：

Lee, Jinkyu ^{[1
]}

Kang, Hong-Goo ^{[1
]}

机构：

[1] Yonsei Univ, Dept Elect & Elect Engn, Seoul 03722, South Korea

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2019年 / 27卷 / 06期

基金：

新加坡国家研究基金会;

关键词：

Single-channel speech enhancement; complex-valued time-frequency mask; exact time-domain reconstruction; spectrogram consistency; SIGNAL ESTIMATION; PHASE; NOISE;

D O I：

10.1109/TASLP.2019.2910638

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents a joint learning algorithm for complex-valued time-frequency (T-F) masks in single-channel speech enhancement systems. Most speech enhancement algorithms operating in a single-channel microphone environment aim to enhance the magnitude component in a T-F domain, while the input noisy phase component is used directly without any processing. Consequently, the mismatch between the processed magnitude and the unprocessed phase degrades the sound quality. To address this issue, a learning method of targeting a T-F mask that is defined in a complex domain has recently been proposed. However, due to a wide dynamic range and an irregular spectrogram pattern of the complex-valued T-F mask, the learning process is difficult even with a large-scale deep learning network. Moreover, the learning process targeting the T-F mask itself does not directly minimize the distortion in spectra or time domains. In order to address these concerns, we focus on three issues: 1) an effective estimation of complex numbers with a wide dynamic range; 2) a learning method that is directly related to speech enhancement performance; and 3) a way to resolve the mismatch between the estimated magnitude and phase spectra. In this study, we propose objective functions that can solve each of these issues and train the network by minimizing them with a joint learning framework. The evaluation results demonstrate that the proposed learning algorithm achieves significant performance improvement in various objective measures and subjective preference listening test.

引用

页码：1098 / 1109

页数：12

共 8 条

[1] Single-channel speech enhancement based on joint constrained dictionary learning
Linhui Sun
Yunyi Bu
Pingan Li
Zihao Wu
EURASIP Journal on Audio, Speech, and Music Processing, 2021
[2] Single-channel speech enhancement based on joint constrained dictionary learning
Sun, Linhui
Bu, Yunyi
Li, Pingan
Wu, Zihao
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
[3] Deep Learning-based Speech Presence Probability Estimation for Noise PSD Estimation in Single-channel Speech Enhancement
Yang, Haemin
Choe, Soyeon
Kim, Keulbit
Kang, Hong-Goo
2018 INTERNATIONAL CONFERENCE ON SIGNALS AND SYSTEMS (ICSIGSYS), 2018, : 267 - 270
[4] Deep Learning with Augmented Kalman Filter for Single-Channel Speech Enhancement
Roy, Sujan Kumar
Nicolson, Aaron
Paliwal, Kuldip K.
2020 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2020,
[5] Phase-Sensitive Joint Learning Algorithms for Deep Learning-Based Speech Enhancement
Lee, Jinkyu
Skoglund, Jan
Shabestary, Turaj
Kang, Hong-Goo
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (08) : 1276 - 1280
[6] Supervised single-channel speech enhancement using ratio mask with joint dictionary learning
Zhang, Long
Bao, Guangzhao
Zhang, Jing
Ye, Zhongfu
SPEECH COMMUNICATION, 2016, 82 : 38 - 52
[7] A HYBRID APPROACH TO COMBINING CONVENTIONAL AND DEEP LEARNING TECHNIQUES FOR SINGLE-CHANNEL SPEECH ENHANCEMENT AND RECOGNITION
Tu, Yan-Hui
Tashev, Ivan
Zarar, Shuayb
Lee, Chin-Hui
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2531 - 2535
[8] Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection
Lee, Geon Woo
Kim, Hong Kook
APPLIED SCIENCES-BASEL, 2020, 10 (09):

← 1 →