Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding

被引:0
|
作者
Kim, Minsoo [1 ]
Jang, Gil-Jin [1 ]
机构
[1] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 18期
关键词
speech recognition; speaker embedding; speaker-attributed training; SEPARATION;
D O I
10.3390/app14188138
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Featured Application Speech recognition; speaker adaptation; speaker diarization.Abstract Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers' voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER).
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
    Stafylakis, Themos
    Mosner, Ladislav
    Plchot, Oldrich
    Rohdin, Johan
    Silnova, Anna
    Burget, Lukas
    Cernocky, Jan Honza
    INTERSPEECH 2022, 2022, : 605 - 609
  • [2] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Toda, Tomoki
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
  • [3] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
    Yousefi, Midia
    Hansen, John H. L.
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288
  • [4] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [5] A hybrid approach to speaker recognition in multi-speaker environment
    Trivedi, J
    Maitra, A
    Mitra, SK
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
  • [6] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zheng, Yibin
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
  • [7] A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
    Yu, Fan
    Du, Zhihao
    Zhang, Shiliang
    Lin, Yuxiao
    Xie, Lei
    INTERSPEECH 2022, 2022, : 560 - 564
  • [8] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
    Lee, Junmo
    Song, Kwangsub
    Noh, Kyoungjin
    Park, Tae-Jun
    Chang, Joon-Hyuk
    2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
  • [9] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    INTERSPEECH 2019, 2019, : 4425 - 4429
  • [10] SPEAKER RECOGNITION FOR MULTI-SPEAKER CONVERSATIONS USING X-VECTORS
    Snyder, David
    Garcia-Romero, Daniel
    Sell, Gregory
    McCree, Alan
    Povey, Daniel
    Khudanpur, Sanjeev
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5796 - 5800