Streaming Multi-Talker ASR with Token-Level Serialized Output Training

被引：8

作者：

Kanda, Naoyuki ^{[1
]}

Wu, Jian ^{[1
]}

Wu, Yu ^{[2
]}

Xiao, Xiong ^{[1
]}

Meng, Zhong ^{[1
]}

Wang, Xiaofei ^{[1
]}

Gaur, Yashesh ^{[1
]}

Chen, Zhuo ^{[1
]}

Li, Jinyu ^{[1
]}

Yoshioka, Takuya ^{[1
]}

机构：

[1] Microsoft Cloud AI, Redmond, WA 98052 USA

[2] Microsoft Res Asia, Beijing, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

multi-talker speech recognition; serialized output training; streaming inference; SPEECH;

D O I：

10.21437/Interspeech.2022-7

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

引用

页码：3774 / 3778

页数：5

共 38 条

[1] BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
Liang, Yuhao
Yu, Fan
Li, Yangze
Guo, Pengcheng
Zhang, Shiliang
Chen, Qian
Xie, Lei
INTERSPEECH 2023, 2023, : 3487 - 3491
[2] CONTINUOUS STREAMING MULTI-TALKER ASR WITH DUAL-PATH TRANSDUCERS
Raj, Desh
Lu, Liang
Chen, Zhuo
Gaur, Yashesh
Li, Jinyu
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7317 - 7321
[3] Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
Kanda, Naoyuki
Wu, Jian
Wu, Yu
Xiao, Xiong
Meng, Zhong
Wang, Xiaofei
Gaur, Yashesh
Chen, Zhuo
Li, Jinyu
Yoshioka, Takuya
INTERSPEECH 2022, 2022, : 521 - 525
[4] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
Lu, Liang
Li, Jinyu
Gong, Yifan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
[5] Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR
von Neumann, Thilo
Boeddeker, Christoph
Drude, Lukas
Kinoshita, Keisuke
Delcroix, Marc
Nakatani, Tomohiro
Haeb-Umbach, Reinhold
INTERSPEECH 2020, 2020, : 3097 - 3101
[6] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
Lu, Liang
Kanda, Naoyuki
Li, Jinyu
Gong, Yifan
INTERSPEECH 2021, 2021, : 1782 - 1786
[7] Streaming End-to-End Multi-Talker Speech Recognition
Lu, Liang
Kanda, Naoyuki
Li, Jinyu
Gong, Yifan
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
[8] Token-level Adaptive Training for Neural Machine Translation
Gu, Shuhao
Zhang, Jinchao
Meng, Fandong
Feng, Yang
Xie, Wanying
Zhou, Jie
Yu, Dong
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1035 - 1046
[9] Recognizing Multi-talker Speech with Permutation Invariant Training
Yu, Dong
Chang, Xuankai
Qian, Yanmin
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2456 - 2460
[10] Knowledge Distillation for End-to-End Monaural Multi-talker ASR System
Zhang, Wangyou
Chang, Xuankai
Qian, Yanmin
INTERSPEECH 2019, 2019, : 2633 - 2637

← 1 2 3 4 →