End-to-End Speaker-Attributed ASR with Transformer

被引:11
|
作者
Kanda, Naoyuki [1 ]
Ye, Guoli [1 ]
Gaur, Yashesh [1 ]
Wang, Xiaofei [1 ]
Meng, Zhong [1 ]
Chen, Zhuo [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
INTERSPEECH 2021 | 2021年
关键词
multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;
D O I
10.21437/Interspeech.2021-101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.
引用
收藏
页码:4413 / 4417
页数:5
相关论文
共 50 条
  • [31] End-to-end recurrent denoising autoencoder embeddings for speaker identification
    Rituerto-Gonzalez, Esther
    Pelaez-Moreno, Carmen
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (21) : 14429 - 14439
  • [32] Adapting Transformer to End-to-end Spoken Language Translation
    Di Gangi, Mattia A.
    Negri, Matteo
    Turchi, Marco
    INTERSPEECH 2019, 2019, : 1133 - 1137
  • [33] META-LEARNING FOR IMPROVING RARE WORD RECOGNITION IN END-TO-END ASR
    Lux, Florian
    Ngoc Thang Vu
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5974 - 5978
  • [34] ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture
    Cheng, Gaofeng
    Miao, Haoran
    Yang, Runyan
    Deng, Keqi
    Yan, Yonghong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1360 - 1373
  • [35] LAYER-NORMALIZED LSTM FOR HYBRID-HMM AND END-TO-END ASR
    Zeineldeen, Mohammad
    Zeyer, Albert
    Schlueter, Ralf
    Ney, Hermann
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7679 - 7683
  • [36] IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR
    Higuchi, Yosuke
    Inaguma, Hirofumi
    Watanabe, Shinji
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8363 - 8367
  • [37] NAM plus : TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR
    Munkhdalai, Tsendsuren
    Wu, Zelin
    Pundak, Golan
    Sim, Khe Chai
    Li, Jiayang
    Rondon, Pat
    Sainath, Tara N.
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 190 - 196
  • [38] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2019, 2019, : 241 - 245
  • [39] DIVE: END-TO-END SPEECH DIARIZATION VIA ITERATIVE SPEAKER EMBEDDING
    Zeghidour, Neil
    Teboul, Olivier
    Grangier, David
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 702 - 709
  • [40] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
    Chang, Feng-Ju
    Radfar, Martin
    Mouchtaris, Athanasios
    King, Brian
    Kunzmann, Siegfried
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888