Target Speaker Extraction with Attention Enhancement and Gated Fusion Mechanism

被引:1
|
作者
Wang Sijie [1 ,2 ]
Hamdulla, Askar [1 ,2 ]
Ablimit, Mijit [1 ,2 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Key Lab Signal Detect & Proc, Urumqi, Peoples R China
来源
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC | 2023年
关键词
target speaker extraction; attention; gated fusion; multi-task learning; NETWORK;
D O I
10.1109/APSIPAASC58517.2023.10317106
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of a target speaker extraction system is to extract the speech of the target speaker from a mixture of multiple speakers and noises using a certain amount of additional information of the target speaker. In this paper, we investigate the improvements of the baseline system by incorporating the light-weight CBAM module in the target extractor, and the gated fusion module (GFM) in the fusion layer. The CBAM introduces attention enhancement to baseline model with no significant increase in the number of parameters and complexity, and the previous concatenation-based fusion method used for speaker embedding and input mixture (or intermediate output) is replaced by GFM, enabling the model to better leverage the supplementary information provided by speaker embedding. Experimental results on datasets built from WSJ0-2mix and WHAM! demonstrate that both the CBAM module and the light-weight GFM module individually improve the model performance, and the GFM module shows better improvement on WHAM!. However, the combination of these two modules only exhibits mutually beneficial effects on the clean dataset WSJ0-2mix, while the performance of the combined module on the noisy dataset WHAM! is inferior to that of using the GFM module alone.
引用
收藏
页码:1995 / 2001
页数:7
相关论文
共 50 条
  • [21] MULTI-CHANNEL TARGET SPEECH EXTRACTION WITH CHANNEL DECORRELATION AND TARGET SPEAKER ADAPTATION
    Han, Jiangyu
    Zhou, Xinyuan
    Long, Yanhua
    Li, Yijie
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6094 - 6098
  • [22] Learning multiscale pipeline gated fusion for underwater image enhancement
    Xu Liu
    Sen Lin
    Zhiyong Tao
    Multimedia Tools and Applications, 2023, 82 : 32281 - 32304
  • [23] End-to-End Speaker Age and Height Estimation using Attention Mechanism and Triplet Loss
    Kaushik, Manav
    Pham, Van Tung
    Anh, Tran The
    Chng, Eng Siong
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 786 - 793
  • [24] Underwater Target Detection Utilizing Polarization Image Fusion Algorithm Based on Unsupervised Learning and Attention Mechanism
    Cheng, Haoyuan
    Zhang, Deqing
    Zhu, Jinchi
    Yu, Hao
    Chu, Jinkui
    SENSORS, 2023, 23 (12)
  • [25] A UNIFIED APPROACH TO SPEAKER SEPARATION AND TARGET SPEAKER EXTRACTION USING ENCODER-DECODER BASED ATTRACTORS
    Chetupalli, Srikanth Raj
    Habets, Emanuel A. P.
    2024 18TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT, IWAENC 2024, 2024, : 190 - 194
  • [26] Image Captioning with Synergy-Gated Attention and Recurrent Fusion LSTM
    Yang, Yo
    Chen, Lizhi
    Pan, Longyue
    Hu, Juntao
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (10): : 3390 - 3405
  • [27] Target Speaker Extraction for Customizable Query-by-Example Keyword Spotting
    Shao, Qijie
    Hou, Jingyong
    Hu, Yanxin
    Wang, Qing
    Xie, Lei
    Lei, Xin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 672 - 678
  • [28] MUSE: MULTI-MODAL TARGET SPEAKER EXTRACTION WITH VISUAL CUES
    Pan, Zexu
    Tao, Ruijie
    Xu, Chenglin
    Li, Haizhou
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6678 - 6682
  • [29] WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction
    Wang, Shuai
    Zhang, Ke
    Lin, Shaoxiong
    Li, Junjie
    Wang, Xuefei
    Ge, Meng
    Yu, Jianwei
    Qian, Yanmin
    Li, Haizhou
    INTERSPEECH 2024, 2024, : 4273 - 4277
  • [30] SINGLE-CHANNEL SPEECH EXTRACTION USING SPEAKER INVENTORY AND ATTENTION NETWORK
    Xiao, Xiong
    Chen, Zhuo
    Yoshioka, Takuya
    Erdogan, Hakan
    Liu, Changliang
    Dimitriadis, Dimitrios
    Droppo, Jasha
    Gong, Yifan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 86 - 90