Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion

被引:2
作者
Li, Xiao [1 ,2 ]
Liu, Ruirui [3 ,4 ]
Huang, Huichou [5 ]
Wu, Qingyao [6 ,7 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510006, Peoples R China
[2] Minist Educ, Key Lab Big Data & Intelligent Robot, Beijing 100081, Peoples R China
[3] Brunel Univ London, Dept Econ & Finance, London UB8 3PH, England
[4] Kings Coll London, Data Analyt Finance & Macro Ctr, London, England
[5] City Univ Hong Kong, Global Res Unit, Hong Kong, Peoples R China
[6] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[7] Pazhou Lab, Guangzhou 510335, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Task analysis; Visualization; Speech enhancement; Decoding; Degradation; Transformers; Speaker extraction; self-supervised learning; contrastive learning; attention;
D O I
10.1109/TASLP.2023.3324550
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Given a reference speech clip from the target speaker, Target Speaker Extraction (TSE) is a challenging task that involves extracting the signal of the target speaker from a multi-speaker environment. TSE networks typically comprise a main network and an auxiliary network. The former utilizes the obtained target speaker embedding to generate an appropriate mask for isolating the signal of the target speaker from those of other speakers, while the latter aims to learn deep discriminative embeddings from the signal of the target speaker. However, the TSE networks often face performance degradation when dealing with unseen speakers or speeches with short references. In this article, we propose a novel approach that leverages contrastive learning in the auxiliary network to obtain better representations of unseen speakers or speeches with short references. Specifically, we employ contrastive learning to bridge the gap between short and long speech features. In this case, the auxiliary network with the input of a short speech generates feature embeddings that are as rich as those obtained from a long speech. Therefore, improving the recognition of unseen speakers or short speeches. Moreover, we introduce an attention-based fusion method that integrates the speaker embedding into the main network in an adaptive manner for enhancing mask generation. Experimental results demonstrate the effectiveness of our proposed method in improving the performance of TSE tasks in realistic open scenarios.
引用
收藏
页码:178 / 188
页数:11
相关论文
共 50 条
  • [41] An Attention-Based Multiscale Spectral-Spatial Network for Hyperspectral Target Detection
    Feng, Shou
    Feng, Rui
    Liu, Jianfei
    Zhao, Chunhui
    Xiong, Fengchao
    Zhang, Lifu
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [42] Two-Level Attention-based Fusion Learning for RGB-D Face Recognition
    Uppal, Hardik
    Sepas-Moghaddam, Alireza
    Greenspan, Michael
    Etemad, Ali
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 10120 - 10127
  • [43] Attention-Based Neural Bag-of-Features Learning for Sequence Data
    Dat Thanh Tran
    Passalis, Nikolaos
    Tefas, Anastasios
    Gabbouj, Moncef
    Iosifidis, Alexandros
    IEEE ACCESS, 2022, 10 : 45542 - 45552
  • [44] Multi-Task Reinforcement Learning With Attention-Based Mixture of Experts
    Cheng, Guangran
    Dong, Lu
    Cai, Wenzhe
    Sun, Changyin
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (06) : 3811 - 3818
  • [45] Hotspot Detection via Attention-Based Deep Layout Metric Learning
    Geng, Hao
    Yang, Haoyu
    Zhang, Lu
    Yang, Fan
    Zeng, Xuan
    Yu, Bei
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (08) : 2685 - 2698
  • [46] Bidirectional Feature Aggregation Network for Stereo Image Quality Assessment Considering Parallax Attention-Based Binocular Fusion
    Chang, Yongli
    Li, Sumei
    Liu, Anqi
    Zhang, Wenlin
    Jin, Jie
    Xiang, Wei
    IEEE TRANSACTIONS ON BROADCASTING, 2024, 70 (01) : 278 - 289
  • [47] Attention-based multi-modal fusion sarcasm detection
    Liu, Jing
    Tian, Shengwei
    Yu, Long
    Long, Jun
    Zhou, Tiejun
    Wang, Bo
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (02) : 2097 - 2108
  • [48] Self-Supervised Spectral-Level Contrastive Learning for Hyperspectral Target Detection
    Wang, Yulei
    Chen, Xi
    Zhao, Enyu
    Song, Meiping
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [49] Contrastive Learning and Similarity Feature Fusion for UAV Image Target Detection
    Wang, Mingmao
    Zhang, Bin
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [50] Supervised Contrastive Learning Based on Fusion of Global and Local Features for Remote Sensing Image Retrieval
    Huang, Mengluan
    Dong, Le
    Dong, Weisheng
    Shi, Guangming
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61