TARGET SPEECH EXTRACTION WITH PRE-TRAINED SELF-SUPERVISED LEARNING MODELS

被引：1

作者：

Peng, Junyi ^{[1
]}

Delcroix, Marc ^{[2
]}

Ochiai, Tsubasa ^{[2
]}

Plchot, Oldrich ^{[1
]}

Araki, Shoko ^{[2
]}

Cemocky, Jan ^{[1
]}

机构：

[1] Brno Univ Technol, Fac Informat Technol, Speech FIT, Brno, Czech Republic

[2] NTT Corp, Chiyoda City, Tokyo, Japan

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

基金：

美国国家科学基金会;

关键词：

Target speech extraction; pre-trained models; self-supervised learning; feature aggregation; NETWORK;

D O I：

10.1109/ICASSP48485.2024.10448315

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.

引用

页码：10421 / 10425

页数：5

共 27 条

[1] Joint Encoder-Decoder Self-Supervised Pre-training for ASR
Arunkumar, A.
Umesh, S.
[J]. INTERSPEECH 2022, 2022, : 3418 - 3422
[2] Baevski A, 2020, ADV NEUR IN, V33
[3] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Chen, Sanyuan
Wang, Chengyi
Chen, Zhengyang
Wu, Yu
Liu, Shujie
Chen, Zhuo
Li, Jinyu
Kanda, Naoyuki
Yoshioka, Takuya
Xiao, Xiong
Wu, Jian
Zhou, Long
Ren, Shuo
Qian, Yanmin
Qian, Yao
Zeng, Michael
Yu, Xiangzhan
Wei, Furu
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
[4] LARGE-SCALE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEAKER VERIFICATION
Chen, Zhengyang
Chen, Sanyuan
Wu, Yu
Qian, Yao
Wang, Chengyi
Liu, Shujie
Qian, Yanmin
Zeng, Michael
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6147 - 6151
[5] Cosentino J, 2020, Arxiv, DOI arXiv:2005.11262
[6] Listen only to me! How well can target speech extraction handle false alarms?
Delcroix, Marc
Kinoshita, Keisuke
Ochiai, Tsubasa
Zmolikova, Katerina
Sato, Hiroshi
Nakatani, Tomohiro
[J]. INTERSPEECH 2022, 2022, : 216 - 220
[7] Delcroix M, 2020, INT CONF ACOUST SPEE, P691, DOI [10.1109/icassp40776.2020.9054683, 10.1109/ICASSP40776.2020.9054683]
[8] SpEx plus : A Complete Time Domain Speaker Extraction Network
Ge, Meng
Xu, Chenglin
Wang, Longbiao
Chng, Eng Siong
Dang, Jianwu
Li, Haizhou
[J]. INTERSPEECH 2020, 2020, : 1406 - 1410
[9] DPCCN: DENSELY-CONNECTED PYRAMID COMPLEX CONVOLUTIONAL NETWORK FOR ROBUST SPEECH SEPARATION AND EXTRACTION
Han, Jiangyu
Long, Yanhua
Burget, Lukas
Cernocky, Jan
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7292 - 7296
[10] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hsu, Wei-Ning
Bolte, Benjamin
Tsai, Yao-Hung Hubert
Lakhotia, Kushal
Salakhutdinov, Ruslan
Mohamed, Abdelrahman
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460

← 1 2 3 →