End-to-End Speaker-Attributed ASR with Transformer

被引:11
|
作者
Kanda, Naoyuki [1 ]
Ye, Guoli [1 ]
Gaur, Yashesh [1 ]
Wang, Xiaofei [1 ]
Meng, Zhong [1 ]
Chen, Zhuo [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
INTERSPEECH 2021 | 2021年
关键词
multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;
D O I
10.21437/Interspeech.2021-101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.
引用
收藏
页码:4413 / 4417
页数:5
相关论文
共 50 条
  • [41] EXPLORING END-TO-END MULTI-CHANNEL ASR WITH BIAS INFORMATION FOR MEETING TRANSCRIPTION
    Wang, Xiaofei
    Kanda, Naoyuki
    Gaur, Yashesh
    Chen, Zhuo
    Meng, Zhong
    Yoshioka, Takuya
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 833 - 840
  • [42] Dysarthric Speech Augmentation Using Prosodic Transformation and Masking for Subword End-to-end ASR
    Soleymanpour, Mohammad
    Johnson, Michael T.
    Berry, Jeffrey
    2021 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2021, : 42 - 46
  • [43] A High-Performance Neural Network SoC for End-to-End Speaker Verification
    Tsai, Tsung-Han
    Chiang, Meng-Jui
    IEEE ACCESS, 2024, 12 : 165482 - 165496
  • [44] PYCHAIN: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
    Shao, Yiwen
    Wang, Yiming
    Povey, Daniel
    Khudanpur, Sanjeev
    INTERSPEECH 2020, 2020, : 561 - 565
  • [45] An End-to-End Text-Independent Speaker Identification System on Short Utterances
    Ji, Ruifang
    Cai, Xinyuan
    Xu, Bo
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3628 - 3632
  • [46] Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system
    Shi, Gui-Xin
    Zhang, Wei-Qiang
    Wang, Guan-Bo
    Zhao, Jing
    Chai, Shu-Zhou
    Zhao, Ze-Yu
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [47] Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction
    Qiu, David
    He, Yanzhang
    Li, Qiujia
    Zhang, Yu
    Gao, Liangliang
    McGraw, Ian
    INTERSPEECH 2021, 2021, : 4074 - 4078
  • [48] Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
    Qin, Siqing
    Wang, Longbiao
    Li, Sheng
    Dang, Jianwu
    Pan, Lixin
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
  • [49] ConMamba: A Convolution-Augmented Mamba Encoder Model for Efficient End-to-End ASR Systems
    Hou, Haoxiang
    Gong, Xun
    Qian, Yanmin
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 711 - 715
  • [50] A Cluster-based Personalized Federated Learning Strategy for End-to-End ASR of Dementia Patients
    Hsu, Wei-Tung
    Chen, Chin-Po
    Lin, Yun-Shao
    Lee, Chi-Chun
    INTERSPEECH 2024, 2024, : 2450 - 2454