TARGET SPEECH EXTRACTION WITH PRE-TRAINED SELF-SUPERVISED LEARNING MODELS

被引:1
作者
Peng, Junyi [1 ]
Delcroix, Marc [2 ]
Ochiai, Tsubasa [2 ]
Plchot, Oldrich [1 ]
Araki, Shoko [2 ]
Cemocky, Jan [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, Speech FIT, Brno, Czech Republic
[2] NTT Corp, Chiyoda City, Tokyo, Japan
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年
基金
美国国家科学基金会;
关键词
Target speech extraction; pre-trained models; self-supervised learning; feature aggregation; NETWORK;
D O I
10.1109/ICASSP48485.2024.10448315
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.
引用
收藏
页码:10421 / 10425
页数:5
相关论文
共 27 条
  • [1] Joint Encoder-Decoder Self-Supervised Pre-training for ASR
    Arunkumar, A.
    Umesh, S.
    [J]. INTERSPEECH 2022, 2022, : 3418 - 3422
  • [2] Baevski A, 2020, ADV NEUR IN, V33
  • [3] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
    Chen, Sanyuan
    Wang, Chengyi
    Chen, Zhengyang
    Wu, Yu
    Liu, Shujie
    Chen, Zhuo
    Li, Jinyu
    Kanda, Naoyuki
    Yoshioka, Takuya
    Xiao, Xiong
    Wu, Jian
    Zhou, Long
    Ren, Shuo
    Qian, Yanmin
    Qian, Yao
    Zeng, Michael
    Yu, Xiangzhan
    Wei, Furu
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
  • [4] LARGE-SCALE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEAKER VERIFICATION
    Chen, Zhengyang
    Chen, Sanyuan
    Wu, Yu
    Qian, Yao
    Wang, Chengyi
    Liu, Shujie
    Qian, Yanmin
    Zeng, Michael
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6147 - 6151
  • [5] Cosentino J, 2020, Arxiv, DOI arXiv:2005.11262
  • [6] Listen only to me! How well can target speech extraction handle false alarms?
    Delcroix, Marc
    Kinoshita, Keisuke
    Ochiai, Tsubasa
    Zmolikova, Katerina
    Sato, Hiroshi
    Nakatani, Tomohiro
    [J]. INTERSPEECH 2022, 2022, : 216 - 220
  • [7] Delcroix M, 2020, INT CONF ACOUST SPEE, P691, DOI [10.1109/icassp40776.2020.9054683, 10.1109/ICASSP40776.2020.9054683]
  • [8] SpEx plus : A Complete Time Domain Speaker Extraction Network
    Ge, Meng
    Xu, Chenglin
    Wang, Longbiao
    Chng, Eng Siong
    Dang, Jianwu
    Li, Haizhou
    [J]. INTERSPEECH 2020, 2020, : 1406 - 1410
  • [9] DPCCN: DENSELY-CONNECTED PYRAMID COMPLEX CONVOLUTIONAL NETWORK FOR ROBUST SPEECH SEPARATION AND EXTRACTION
    Han, Jiangyu
    Long, Yanhua
    Burget, Lukas
    Cernocky, Jan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7292 - 7296
  • [10] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460