HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS

被引：3

作者：

Chang, Xuankai ^{[1
]}

Kanda, Naoyuki ^{[2
]}

Gaur, Yashesh ^{[2
]}

Wang, Xiaofei ^{[2
]}

Meng, Zhong ^{[2
]}

Yoshioka, Takuya ^{[2
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Microsoft Corp, Redmond, WA 98052 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Hypothesis stitcher; speech recognition; speaker identification; rich transcription;

D O I：

10.1109/ICASSP39728.2021.9414432

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

引用

页码：6763 / 6767

页数：5

共 24 条

[1] INVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FOR CONTINUOUS MULTI-TALKER RECORDINGS
Kanda, Naoyuki
Chang, Xuankai
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 809 - 816
[2] End-to-End Speaker-Attributed ASR with Transformer
Kanda, Naoyuki
Ye, Guoli
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
INTERSPEECH 2021, 2021, : 4413 - 4417
[3] MINIMUM BAYES RISK TRAINING FOR END-TO-END SPEAKER-ATTRIBUTED ASR
Kanda, Naoyuki
Meng, Zhong
Lu, Liang
Gaur, Yashesh
Wang, Xiaofei
Chen, Zhuo
Yoshioka, Takuya
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6503 - 6507
[4] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
Lu, Liang
Li, Jinyu
Gong, Yifan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
[5] Knowledge Distillation for End-to-End Monaural Multi-talker ASR System
Zhang, Wangyou
Chang, Xuankai
Qian, Yanmin
INTERSPEECH 2019, 2019, : 2633 - 2637
[6] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
Rose, Richard
Siohan, Olivier
INTERSPEECH 2022, 2022, : 2828 - 2832
[7] A COMPARATIVE STUDY OF MODULAR AND JOINT APPROACHES FOR SPEAKER-ATTRIBUTED ASR ON MONAURAL LONG-FORM AUDIO
Kanda, Naoyuki
Xiao, Xiong
Wu, Jian
Zhou, Tianyan
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 296 - 303
[8] TRANSCRIBE-TO-DIARIZE: NEURAL SPEAKER DIARIZATION FOR UNLIMITED NUMBER OF SPEAKERS USING END-TO-END SPEAKER-ATTRIBUTED ASR
Kanda, Naoyuki
Xiao, Xiong
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8082 - 8086
[9] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
Tripathi, Anshuman
Lu, Han
Sak, Hasim
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
[10] Streaming End-to-End Multi-Talker Speech Recognition
Lu, Liang
Kanda, Naoyuki
Li, Jinyu
Gong, Yifan
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807

← 1 2 3 →