Task-aware Warping Factors in Mask-based Speech Enhancement

被引：0

作者：

Wang, Qiongqiong ^{[1
]}

Lee, Kong Aik ^{[1
]}

Koshinaka, Takafumi ^{[1
]}

Okabe, Koji ^{[1
]}

Yamamoto, Hitoshi ^{[1
]}

机构：

[1] NEC Corp Ltd, Biometr Res Labs, Tokyo, Japan

来源：

29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021) | 2021年

关键词：

Speech enhancement; time-frequency; mask; deep learning; ASV; ASR;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes the use of two task-aware warping factors in mask-based speech enhancement (SE). One controls the balance between speech-maintenance and noise-removal in training phases, while the other controls the degree of enhancement applied to specific downstream tasks in testing phases. Our proposal is based on the observation that SE systems trained to improve speech quality often fail to improve other downstream tasks, such as automatic speaker verification (ASV) and automatic speech recognition (ASR), because they do not share the same objectives. It is easy to apply the proposed dual-warping factors approach to any mask-based SE method, and it allows a single SE base module to handle multiple tasks without task-dependent training. The effectiveness of our proposed approach has been confirmed on the SITW dataset for ASV evaluation and the LibriSpeech test-clean set for ASR and speech quality evaluations of 0-20dB. We show that different warping values are necessary in the testing phases for a single SE base module to achieve optimal performance w.r.t. the three tasks. With the use of task-aware warping factors, speech quality was improved by an 84.7% PESQ increase, while ASV had a 22.4% EER reduction, and ASR had a 52.2% WER reduction, on 0dB speech. The effectiveness of the task-aware warping factors were also cross-validated on VoxCeleb-1 test set for ASV and LibriSpeech dev-clean set for ASR and quality evaluations. The proposed method is highly effective and easy to apply in practice.

引用

页码：476 / 480

页数：5

共 33 条

[1]

[Anonymous], 2013, COMPUT REV

[2] Deep Learning: Methods and Applications [J].

Deng, Li ;

Yu, Dong .

FOUNDATIONS AND TRENDS IN SIGNAL PROCESSING, 2013, 7 (3-4) :I-387

[3]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

[4]

Ferrari L., 2011, P 3 ACM SIGSPATIAL I, P9

[5]

Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261

[6] Convolutional Networks with Dense Connectivity [J].

Huang, Gao ;

Liu, Zhuang ;

Pleiss, Geoff ;

van der Maaten, Laurens ;

Weinberger, Kilian Q. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8704-8716

[7]

Kishore V., 2020, INTERSPEECH

[8] The Speakers in the Wild (SITW) Speaker Recognition Database [J].

McLaren, Mitchell ;

Ferrer, Luciana ;

Castan, Diego ;

Lawson, Aaron .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :818-822

[9]

Nagrani Arsha, 2019, Computer Speech and Language

[10]

Narayanan A, 2013, INT CONF ACOUST SPEE, P7092, DOI 10.1109/ICASSP.2013.6639038

← 1 2 3 4 →