Speaker extraction network with attention mechanism for speech dialogue system

被引：1

作者：

Hao, Yun ^{[1
]}

Wu, Jiaju ^{[1
]}

Huang, Xiangkang ^{[1
]}

Zhang, Zijia ^{[1
]}

Liu, Fei ^{[1
]}

Wu, Qingyao ^{[1
,2
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

[2] Pazhou Lab, Guangzhou, Peoples R China

来源：

SERVICE ORIENTED COMPUTING AND APPLICATIONS | 2022年 / 16卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Speech dialogue system; Speech separation; Multi-task; Attention; SEPARATION; ENHANCEMENT;

D O I：

10.1007/s11761-022-00340-w

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.

引用

页码：111 / 119

页数：9

共 50 条

[31] Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation
Liu, Debang
Zhang, Tianqi
Christensen, Mads Graesboll
Yi, Chen
An, Zeliang
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4647 - 4660
[32] Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
Kim, Minsoo
Jang, Gil-Jin
APPLIED SCIENCES-BASEL, 2024, 14 (18):
[33] Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech
Xu, Chenglin
Rao, Wei
Wu, Jibin
Li, Haizhou
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2696 - 2709
[34] ATTENTION-BASED SCALING ADAPTATION FOR TARGET SPEECH EXTRACTION
Han, Jiangyu
Rao, Wei
Long, Yanhua
Liang, Jiaen
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 658 - 662
[35] CONTINUOUS SPEECH SEPARATION WITH RECURRENT SELECTIVE ATTENTION NETWORK
Zhang, Yixuan
Chen, Zhuo
Wu, Jian
Yoshioka, Takuya
Wang, Peidong
Meng, Zhong
Li, Jinyu
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6017 - 6021
[36] wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered Speech
Borsdorf, Marvin
Pan, Zexu
Li, Haizhou
Schultz, Tanja
INTERSPEECH 2024, 2024, : 5038 - 5042
[37] MULTI-CHANNEL TARGET SPEECH EXTRACTION WITH CHANNEL DECORRELATION AND TARGET SPEAKER ADAPTATION
Han, Jiangyu
Zhou, Xinyuan
Long, Yanhua
Li, Yijie
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6094 - 6098
[38] CONTEXT-AWARE ATTENTION MECHANISM FOR SPEECH EMOTION RECOGNITION
Ramet, Gaetan
Garner, Philip N.
Baeriswyl, Michael
Lazaridis, Alexandros
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 126 - 131
[39] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
Li, Chenda
Qian, Yanmin
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
[40] Deep Refinement: capsule network with attention mechanism-based system for text classification
Deepak Kumar Jain
Rachna Jain
Yash Upadhyay
Abhishek Kathuria
Xiangyuan Lan
Neural Computing and Applications, 2020, 32 : 1839 - 1856

← 1 2 3 4 5 →