End-to-End Speaker-Attributed ASR with Transformer

被引:11
|
作者
Kanda, Naoyuki [1 ]
Ye, Guoli [1 ]
Gaur, Yashesh [1 ]
Wang, Xiaofei [1 ]
Meng, Zhong [1 ]
Chen, Zhuo [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
INTERSPEECH 2021 | 2021年
关键词
multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;
D O I
10.21437/Interspeech.2021-101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.
引用
收藏
页码:4413 / 4417
页数:5
相关论文
共 50 条
  • [21] Data Augmentation Using CycleGAN for End-to-End Children ASR
    Singh, Dipesh K.
    Amin, Preet P.
    Sailor, Hardik B.
    Patil, Hemant A.
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
  • [22] TWO-PASS END-TO-END ASR MODEL COMPRESSION
    Dawalatabad, Nauman
    Vatsal, Tushar
    Gupta, Ashutosh
    Kim, Sungsoo
    Singh, Shatrughan
    Gowda, Dhananjaya
    Kim, Chanwoo
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 403 - 410
  • [23] End-to-End ASR with Adaptive Span Self-Attention
    Chang, Xuankai
    Subramanian, Aswin Shanmugam
    Guo, Pengcheng
    Watanabe, Shinji
    Fujita, Yuya
    Omachi, Motoi
    INTERSPEECH 2020, 2020, : 3595 - 3599
  • [24] Extremely Low Footprint End-to-End ASR System for Smart Device
    Gao, Zhifu
    Yao, Yiwu
    Zhang, Shiliang
    Yang, Jun
    Lei, Ming
    McLoughlin, Ian
    INTERSPEECH 2021, 2021, : 4548 - 4552
  • [25] Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
    Joshi, Vikas
    Das, Amit
    Sun, Eric
    Mehta, Rupesh R.
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2021, 2021, : 1767 - 1771
  • [26] Using Large Language Model for End-to-End Chinese ASR and NER
    Li, Yuang
    Yu, Jiawei
    Zhang, Min
    Ren, Mengxin
    Zhao, Yanqing
    Zhao, Xiaofeng
    Tao, Shimin
    Su, Jinsong
    Yang, Hao
    INTERSPEECH 2024, 2024, : 822 - 826
  • [27] BACK-TRANSLATION-STYLE DATA AUGMENTATION FOR END-TO-END ASR
    Hayashi, Tomoki
    Watanabe, Shinji
    Zhang, Yu
    Toda, Tomoki
    Hori, Takaaki
    Astudillo, Ramon
    Takeda, Kazuya
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 426 - 433
  • [28] Improved Relation Networks for End-to-End Speaker Verification and Identification
    Chaubey, Ashutosh
    Sinha, Sparsh
    Ghose, Susmita
    INTERSPEECH 2022, 2022, : 5085 - 5089
  • [29] End-to-end recurrent denoising autoencoder embeddings for speaker identification
    Esther Rituerto-González
    Carmen Peláez-Moreno
    Neural Computing and Applications, 2021, 33 : 14429 - 14439
  • [30] End-to-end Neural Diarization: From Transformer to Conformer
    Liu, Yi Chieh
    Han, Eunjung
    Lee, Chul
    Stolcke, Andreas
    INTERSPEECH 2021, 2021, : 3081 - 3085