Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding

被引:0
|
作者
Kim, Minsoo [1 ]
Jang, Gil-Jin [1 ]
机构
[1] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 18期
关键词
speech recognition; speaker embedding; speaker-attributed training; SEPARATION;
D O I
10.3390/app14188138
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Featured Application Speech recognition; speaker adaptation; speaker diarization.Abstract Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers' voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER).
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    INTERSPEECH 2020, 2020, : 2932 - 2936
  • [42] Speaker detection using multi-speaker audio files for both enrollment and test
    Bonastre, JF
    Meignier, S
    Merlin, T
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PROCEEDINGS: SPEECH II; INDUSTRY TECHNOLOGY TRACKS; DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS; NEURAL NETWORKS FOR SIGNAL PROCESSING, 2003, : 77 - 80
  • [43] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    INTERSPEECH 2021, 2021, : 2337 - 2338
  • [44] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244
  • [45] Research on ASIC for multi-speaker isolated word recognition
    Xiong, B
    Sun, YH
    1996 2ND INTERNATIONAL CONFERENCE ON ASIC, PROCEEDINGS, 1996, : 135 - 137
  • [46] MULTI-SPEAKER AND CONTEXT-INDEPENDENT ACOUSTICAL CUES FOR AUTOMATIC SPEECH RECOGNITION
    ROSSI, M
    NISHINUMA, Y
    MERCIER, G
    SPEECH COMMUNICATION, 1983, 2 (2-3) : 215 - 217
  • [47] An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets
    Gallegos, Pilar Oplustil
    Williams, Jennifer
    Rownicka, Joanna
    King, Simon
    INTERSPEECH 2020, 2020, : 1758 - 1762
  • [48] GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
    Yang, Jinhyeok
    Bae, Jae-Sung
    Bak, Taejun
    Kim, Young-Ik
    Cho, Hoon-Young
    INTERSPEECH 2021, 2021, : 2202 - 2206
  • [49] ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding
    He, Mao-Kui
    Du, Jun
    Liu, Qing-Feng
    Lee, Chin-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1561 - 1573
  • [50] Silent versus modal multi-speaker speech recognition from ultrasound and video
    Ribeiro, Manuel Sam
    Eshky, Aciel
    Richmond, Korin
    Renals, Steve
    INTERSPEECH 2021, 2021, : 641 - 645