Speaker extraction network with attention mechanism for speech dialogue system

被引:1
|
作者
Hao, Yun [1 ]
Wu, Jiaju [1 ]
Huang, Xiangkang [1 ]
Zhang, Zijia [1 ]
Liu, Fei [1 ]
Wu, Qingyao [1 ,2 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] Pazhou Lab, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech dialogue system; Speech separation; Multi-task; Attention; SEPARATION; ENHANCEMENT;
D O I
10.1007/s11761-022-00340-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.
引用
收藏
页码:111 / 119
页数:9
相关论文
共 50 条
  • [21] Speech Enhancement for Multimodal Speaker Diarization System
    Ahmad, Rehan
    Zubair, Syed
    Alquhayz, Hani
    IEEE ACCESS, 2020, 8 : 126671 - 126680
  • [22] SEF-Net: Speaker Embedding Free Target Speaker Extraction Network
    Zeng, Bang
    Suo, Hongbin
    Wan, Yulong
    Li, Ming
    INTERSPEECH 2023, 2023, : 3452 - 3456
  • [23] Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications
    Zhang, Yiru
    Liu, Bijing
    Yang, Yong
    Yang, Qun
    ELECTRONICS, 2024, 13 (11)
  • [24] Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
    Li, Xiao
    Liu, Ruirui
    Huang, Huichou
    Wu, Qingyao
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 178 - 188
  • [25] A Pitch-aware Speaker Extraction Serial Network
    Jiang, Yu
    Ge, Meng
    Wang, Longbiao
    Dang, Jianwu
    Honda, Kiyoshi
    Zhang, Sulin
    Yu, Bo
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 616 - 620
  • [26] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2019, 2019, : 241 - 245
  • [27] SPEAKER REINFORCEMENT USING TARGET SOURCE EXTRACTION FOR ROBUST AUTOMATIC SPEECH RECOGNITION
    Zorila, Catalin
    Doddipatla, Rama
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6297 - 6301
  • [28] Speaker-independent auditory attention decoding without access to clean speech sources
    Han, Cong
    O'Sullivan, James
    Luo, Yi
    Herrero, Jose
    Mehta, Ashesh D.
    Mesgarani, Nima
    SCIENCE ADVANCES, 2019, 5 (05)
  • [29] Speaker Attractor Network: Generalizing Speech Separation to Unseen Numbers of Sources
    Jiang, Fei
    Duan, Zhiyao
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 1859 - 1863
  • [30] IMPROVING SPEAKER DISCRIMINATION OF TARGET SPEECH EXTRACTION WITH TIME-DOMAIN SPEAKERBEAM
    Delcroix, Marc
    Ochiai, Tsubasa
    Zmolikova, Katerina
    Kinoshita, Keisuke
    Tawara, Naohiro
    Nakatani, Tomohiro
    Araki, Shoko
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 691 - 695