Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding

被引:0
|
作者
Kim, Minsoo [1 ]
Jang, Gil-Jin [1 ]
机构
[1] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 18期
关键词
speech recognition; speaker embedding; speaker-attributed training; SEPARATION;
D O I
10.3390/app14188138
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Featured Application Speech recognition; speaker adaptation; speaker diarization.Abstract Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers' voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER).
引用
收藏
页数:17
相关论文
共 50 条
  • [31] AN INVESTIGATION OF MULTI-SPEAKER TRAINING FORWAVENET VOCODER
    Hayashi, Tomoki
    Tamamori, Akira
    Kobayashi, Kazuhiro
    Takeda, Kazuya
    Toda, Tomoki
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 712 - 718
  • [32] Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion
    Aloradi, Ahmad
    Mack, Wolfgang
    Elminshawi, Mohamed
    Habets, EmanuM A. P.
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 354 - 358
  • [33] A Purely End-to-end System for Multi-speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
  • [34] Multi-Speaker Adaptation for Robust Speech Recognition under Ubiquitous Environment
    Shih, Po-Yi
    Wang, Jhing-Fa
    Lin, Yuan-Ning
    Fu, Zhong-Hua
    ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2009, : 126 - 131
  • [35] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    INTERSPEECH 2021, 2021, : 3141 - 3145
  • [36] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    SPEECH COMMUNICATION, 2021, 132 : 132 - 145
  • [37] TOWARDS MULTI-SPEAKER UNSUPERVISED SPEECH PATTERN DISCOVERY
    Zhang, Yaodong
    Glass, James R.
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4366 - 4369
  • [38] Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario
    Peng, Chiang-Jen
    Chan, Yun-Ju
    Yu, Cheng
    Wang, Syu-Siang
    Tsao, Yu
    Chi, Tai-Shih
    2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [39] MULTI-SPEAKER, NARROWBAND, CONTINUOUS MARATHI SPEECH DATABASE
    Godambe, Tejas
    Bondale, Nandini
    Samudravijaya, K.
    Rao, Preeti
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [40] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
    Chien, Chung-Ming
    Lin, Jheng-Hao
    Huang, Chien-yu
    Hsu, Po-chun
    Lee, Hung-yi
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592