Target Speaker Extraction with Attention Enhancement and Gated Fusion Mechanism

被引：1

作者：

Wang Sijie ^{[1
,2
]}

Hamdulla, Askar ^{[1
,2
]}

Ablimit, Mijit ^{[1
,2
]}

机构：

[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China

[2] Key Lab Signal Detect & Proc, Urumqi, Peoples R China

来源：

2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC | 2023年

关键词：

target speaker extraction; attention; gated fusion; multi-task learning; NETWORK;

D O I：

10.1109/APSIPAASC58517.2023.10317106

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The objective of a target speaker extraction system is to extract the speech of the target speaker from a mixture of multiple speakers and noises using a certain amount of additional information of the target speaker. In this paper, we investigate the improvements of the baseline system by incorporating the light-weight CBAM module in the target extractor, and the gated fusion module (GFM) in the fusion layer. The CBAM introduces attention enhancement to baseline model with no significant increase in the number of parameters and complexity, and the previous concatenation-based fusion method used for speaker embedding and input mixture (or intermediate output) is replaced by GFM, enabling the model to better leverage the supplementary information provided by speaker embedding. Experimental results on datasets built from WSJ0-2mix and WHAM! demonstrate that both the CBAM module and the light-weight GFM module individually improve the model performance, and the GFM module shows better improvement on WHAM!. However, the combination of these two modules only exhibits mutually beneficial effects on the clean dataset WSJ0-2mix, while the performance of the combined module on the noisy dataset WHAM! is inferior to that of using the GFM module alone.

引用

页码：1995 / 2001

页数：7

共 50 条

[21] MULTI-CHANNEL TARGET SPEECH EXTRACTION WITH CHANNEL DECORRELATION AND TARGET SPEAKER ADAPTATION
Han, Jiangyu
Zhou, Xinyuan
Long, Yanhua
Li, Yijie
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6094 - 6098
[22] Learning multiscale pipeline gated fusion for underwater image enhancement
Xu Liu
Sen Lin
Zhiyong Tao
Multimedia Tools and Applications, 2023, 82 : 32281 - 32304
[23] End-to-End Speaker Age and Height Estimation using Attention Mechanism and Triplet Loss
Kaushik, Manav
Pham, Van Tung
Anh, Tran The
Chng, Eng Siong
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 786 - 793
[24] Underwater Target Detection Utilizing Polarization Image Fusion Algorithm Based on Unsupervised Learning and Attention Mechanism
Cheng, Haoyuan
Zhang, Deqing
Zhu, Jinchi
Yu, Hao
Chu, Jinkui
SENSORS, 2023, 23 (12)
[25] A UNIFIED APPROACH TO SPEAKER SEPARATION AND TARGET SPEAKER EXTRACTION USING ENCODER-DECODER BASED ATTRACTORS
Chetupalli, Srikanth Raj
Habets, Emanuel A. P.
2024 18TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT, IWAENC 2024, 2024, : 190 - 194
[26] Image Captioning with Synergy-Gated Attention and Recurrent Fusion LSTM
Yang, Yo
Chen, Lizhi
Pan, Longyue
Hu, Juntao
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (10): : 3390 - 3405
[27] Target Speaker Extraction for Customizable Query-by-Example Keyword Spotting
Shao, Qijie
Hou, Jingyong
Hu, Yanxin
Wang, Qing
Xie, Lei
Lei, Xin
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 672 - 678
[28] MUSE: MULTI-MODAL TARGET SPEAKER EXTRACTION WITH VISUAL CUES
Pan, Zexu
Tao, Ruijie
Xu, Chenglin
Li, Haizhou
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6678 - 6682
[29] WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction
Wang, Shuai
Zhang, Ke
Lin, Shaoxiong
Li, Junjie
Wang, Xuefei
Ge, Meng
Yu, Jianwei
Qian, Yanmin
Li, Haizhou
INTERSPEECH 2024, 2024, : 4273 - 4277
[30] SINGLE-CHANNEL SPEECH EXTRACTION USING SPEAKER INVENTORY AND ATTENTION NETWORK
Xiao, Xiong
Chen, Zhuo
Yoshioka, Takuya
Erdogan, Hakan
Liu, Changliang
Dimitriadis, Dimitrios
Droppo, Jasha
Gong, Yifan
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 86 - 90

← 1 2 3 4 5 →