HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS

被引:3
|
作者
Chang, Xuankai [1 ]
Kanda, Naoyuki [2 ]
Gaur, Yashesh [2 ]
Wang, Xiaofei [2 ]
Meng, Zhong [2 ]
Yoshioka, Takuya [2 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Hypothesis stitcher; speech recognition; speaker identification; rich transcription;
D O I
10.1109/ICASSP39728.2021.9414432
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.
引用
收藏
页码:6763 / 6767
页数:5
相关论文
共 24 条
  • [1] INVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FOR CONTINUOUS MULTI-TALKER RECORDINGS
    Kanda, Naoyuki
    Chang, Xuankai
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 809 - 816
  • [2] End-to-End Speaker-Attributed ASR with Transformer
    Kanda, Naoyuki
    Ye, Guoli
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    INTERSPEECH 2021, 2021, : 4413 - 4417
  • [3] MINIMUM BAYES RISK TRAINING FOR END-TO-END SPEAKER-ATTRIBUTED ASR
    Kanda, Naoyuki
    Meng, Zhong
    Lu, Liang
    Gaur, Yashesh
    Wang, Xiaofei
    Chen, Zhuo
    Yoshioka, Takuya
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6503 - 6507
  • [4] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
    Lu, Liang
    Li, Jinyu
    Gong, Yifan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
  • [5] Knowledge Distillation for End-to-End Monaural Multi-talker ASR System
    Zhang, Wangyou
    Chang, Xuankai
    Qian, Yanmin
    INTERSPEECH 2019, 2019, : 2633 - 2637
  • [6] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
    Rose, Richard
    Siohan, Olivier
    INTERSPEECH 2022, 2022, : 2828 - 2832
  • [7] A COMPARATIVE STUDY OF MODULAR AND JOINT APPROACHES FOR SPEAKER-ATTRIBUTED ASR ON MONAURAL LONG-FORM AUDIO
    Kanda, Naoyuki
    Xiao, Xiong
    Wu, Jian
    Zhou, Tianyan
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 296 - 303
  • [8] TRANSCRIBE-TO-DIARIZE: NEURAL SPEAKER DIARIZATION FOR UNLIMITED NUMBER OF SPEAKERS USING END-TO-END SPEAKER-ATTRIBUTED ASR
    Kanda, Naoyuki
    Xiao, Xiong
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8082 - 8086
  • [9] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
    Tripathi, Anshuman
    Lu, Han
    Sak, Hasim
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
  • [10] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807