Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion

被引:2
|
作者
Li, Xiao [1 ,2 ]
Liu, Ruirui [3 ,4 ]
Huang, Huichou [5 ]
Wu, Qingyao [6 ,7 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510006, Peoples R China
[2] Minist Educ, Key Lab Big Data & Intelligent Robot, Beijing 100081, Peoples R China
[3] Brunel Univ London, Dept Econ & Finance, London UB8 3PH, England
[4] Kings Coll London, Data Analyt Finance & Macro Ctr, London, England
[5] City Univ Hong Kong, Global Res Unit, Hong Kong, Peoples R China
[6] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[7] Pazhou Lab, Guangzhou 510335, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Task analysis; Visualization; Speech enhancement; Decoding; Degradation; Transformers; Speaker extraction; self-supervised learning; contrastive learning; attention;
D O I
10.1109/TASLP.2023.3324550
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Given a reference speech clip from the target speaker, Target Speaker Extraction (TSE) is a challenging task that involves extracting the signal of the target speaker from a multi-speaker environment. TSE networks typically comprise a main network and an auxiliary network. The former utilizes the obtained target speaker embedding to generate an appropriate mask for isolating the signal of the target speaker from those of other speakers, while the latter aims to learn deep discriminative embeddings from the signal of the target speaker. However, the TSE networks often face performance degradation when dealing with unseen speakers or speeches with short references. In this article, we propose a novel approach that leverages contrastive learning in the auxiliary network to obtain better representations of unseen speakers or speeches with short references. Specifically, we employ contrastive learning to bridge the gap between short and long speech features. In this case, the auxiliary network with the input of a short speech generates feature embeddings that are as rich as those obtained from a long speech. Therefore, improving the recognition of unseen speakers or short speeches. Moreover, we introduce an attention-based fusion method that integrates the speaker embedding into the main network in an adaptive manner for enhancing mask generation. Experimental results demonstrate the effectiveness of our proposed method in improving the performance of TSE tasks in realistic open scenarios.
引用
收藏
页码:178 / 188
页数:11
相关论文
共 50 条
  • [21] Fine-Grained Abandoned Cropland Mapping in Southern China Using Pixel Attention Contrastive Learning
    Li, Haoyang
    Lin, Haomei
    Luo, Junshen
    Wang, Teng
    Chen, Hao
    Xu, Qiuting
    Zhang, Xinchang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 2283 - 2295
  • [22] Attention-Based Multistage Fusion Network for Remote Sensing Image Pansharpening
    Zhang, Wanwan
    Li, Jinjiang
    Hua, Zhen
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [23] Virtual Fusion With Contrastive Learning for Single-Sensor-Based Activity Recognition
    Nguyen, Duc-Anh
    Pham, Cuong
    Le-Khac, Nhien-An
    IEEE SENSORS JOURNAL, 2024, 24 (15) : 25041 - 25048
  • [24] Multimodal Sentiment Analysis of Government Information Comments Based on Contrastive Learning and Cross-Attention Fusion Networks
    Mu, Guangyu
    Chen, Chuanzhi
    Li, Xiurong
    Li, Jiaxue
    Ju, Xiaoqing
    Dai, Jiaxiu
    IEEE ACCESS, 2024, 12 : 165525 - 165538
  • [25] Multi-Source Attention-Based Fusion for Segmentation of Natural Disasters
    El Rai, Marwa Chendeb
    Darweesh, Muna
    Far, Aicha Beya
    Gawanmeh, Amjad
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
  • [26] Explainable Attention-Based AAV Target Detection for Search and Rescue Scenarios
    Liu, Shiyu
    Yi, Ling
    Xiong, Xuanrui
    Tolba, Amr
    Ding, Jinliang
    Li, Chun
    IEEE INTERNET OF THINGS JOURNAL, 2025, 12 (05): : 4922 - 4934
  • [27] Dual-Feature Attention-Based Contrastive Prototypical Clustering for Multimodal Remote Sensing Data
    Xu, Shufang
    Ding, Xinchen
    Zhang, Yiyan
    Zhang, Zhen
    Gao, Hongmin
    Zhang, Bing
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [28] Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech
    Xu, Chenglin
    Rao, Wei
    Wu, Jibin
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2696 - 2709
  • [29] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2019, 2019, : 241 - 245
  • [30] Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario
    Peng, Chiang-Jen
    Chan, Yun-Ju
    Yu, Cheng
    Wang, Syu-Siang
    Tsao, Yu
    Chi, Tai-Shih
    2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,