Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion

被引:2
|
作者
Li, Xiao [1 ,2 ]
Liu, Ruirui [3 ,4 ]
Huang, Huichou [5 ]
Wu, Qingyao [6 ,7 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510006, Peoples R China
[2] Minist Educ, Key Lab Big Data & Intelligent Robot, Beijing 100081, Peoples R China
[3] Brunel Univ London, Dept Econ & Finance, London UB8 3PH, England
[4] Kings Coll London, Data Analyt Finance & Macro Ctr, London, England
[5] City Univ Hong Kong, Global Res Unit, Hong Kong, Peoples R China
[6] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[7] Pazhou Lab, Guangzhou 510335, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Task analysis; Visualization; Speech enhancement; Decoding; Degradation; Transformers; Speaker extraction; self-supervised learning; contrastive learning; attention;
D O I
10.1109/TASLP.2023.3324550
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Given a reference speech clip from the target speaker, Target Speaker Extraction (TSE) is a challenging task that involves extracting the signal of the target speaker from a multi-speaker environment. TSE networks typically comprise a main network and an auxiliary network. The former utilizes the obtained target speaker embedding to generate an appropriate mask for isolating the signal of the target speaker from those of other speakers, while the latter aims to learn deep discriminative embeddings from the signal of the target speaker. However, the TSE networks often face performance degradation when dealing with unseen speakers or speeches with short references. In this article, we propose a novel approach that leverages contrastive learning in the auxiliary network to obtain better representations of unseen speakers or speeches with short references. Specifically, we employ contrastive learning to bridge the gap between short and long speech features. In this case, the auxiliary network with the input of a short speech generates feature embeddings that are as rich as those obtained from a long speech. Therefore, improving the recognition of unseen speakers or short speeches. Moreover, we introduce an attention-based fusion method that integrates the speaker embedding into the main network in an adaptive manner for enhancing mask generation. Experimental results demonstrate the effectiveness of our proposed method in improving the performance of TSE tasks in realistic open scenarios.
引用
收藏
页码:178 / 188
页数:11
相关论文
共 50 条
  • [1] Target Speaker Extraction with Attention Enhancement and Gated Fusion Mechanism
    Wang Sijie
    Hamdulla, Askar
    Ablimit, Mijit
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1995 - 2001
  • [2] ATTENTION-BASED NEURAL NETWORK FOR JOINT DIARIZATION AND SPEAKER EXTRACTION
    Chazan, Shlomo E.
    Gannot, Sharon
    Goldberger, Jacob
    2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, : 301 - 305
  • [3] RUL Prediction Using a Fusion of Attention-Based Convolutional Variational AutoEncoder and Ensemble Learning Classifier
    Remadna, Ikram
    Terrissa, Labib Sadek
    Al Masry, Zeina
    Zerhouni, Noureddine
    IEEE TRANSACTIONS ON RELIABILITY, 2023, 72 (01) : 106 - 124
  • [4] Towards Attention-based Contrastive Learning for Audio Spoof Detection
    Goel, Chirag
    Koppisetti, Surya
    Colman, Ben
    Shahriyari, Ali
    Bharaj, Gaurav
    INTERSPEECH 2023, 2023, : 2758 - 2762
  • [5] ATTENTION-BASED SCALING ADAPTATION FOR TARGET SPEECH EXTRACTION
    Han, Jiangyu
    Rao, Wei
    Long, Yanhua
    Liang, Jiaen
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 658 - 662
  • [6] Atss-Net: Target Speaker Separation via Attention-based Neural Network
    Li, Tingle
    Lin, Qingjian
    Bao, Yuanyuan
    Li, Ming
    INTERSPEECH 2020, 2020, : 1411 - 1415
  • [7] Unified feature extraction framework based on contrastive learning
    Zhang, Hongjie
    Qiang, Wenwen
    Zhang, Jinxin
    Chen, Yingyi
    Jing, Ling
    KNOWLEDGE-BASED SYSTEMS, 2022, 258
  • [8] Attention-Based Contrastive Learning for Few-Shot Remote Sensing Image Classification
    Xu, Yulong
    Bi, Hanbo
    Yu, Hongfeng
    Lu, Wanxuan
    Li, Peifeng
    Li, Xinming
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [9] Binaural Selective Attention Model for Target Speaker Extraction
    Meng, Hanyu
    Zhang, Qiquan
    Zhang, Xiangyu
    Sethu, Vidhyasaharan
    Ambikairajah, Eliathamby
    INTERSPEECH 2024, 2024, : 4323 - 4327
  • [10] Attention-Based Convolutional LSTM for Describing Video
    Liu, Zhongyu
    Chen, Tian
    Ding, Enjie
    Liu, Yafeng
    Yu, Wanli
    IEEE ACCESS, 2020, 8 : 133713 - 133724