Streaming Multi-Talker ASR with Token-Level Serialized Output Training

被引:8
|
作者
Kanda, Naoyuki [1 ]
Wu, Jian [1 ]
Wu, Yu [2 ]
Xiao, Xiong [1 ]
Meng, Zhong [1 ]
Wang, Xiaofei [1 ]
Gaur, Yashesh [1 ]
Chen, Zhuo [1 ]
Li, Jinyu [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Cloud AI, Redmond, WA 98052 USA
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
INTERSPEECH 2022 | 2022年
关键词
multi-talker speech recognition; serialized output training; streaming inference; SPEECH;
D O I
10.21437/Interspeech.2022-7
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.
引用
收藏
页码:3774 / 3778
页数:5
相关论文
共 38 条
  • [1] BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
    Liang, Yuhao
    Yu, Fan
    Li, Yangze
    Guo, Pengcheng
    Zhang, Shiliang
    Chen, Qian
    Xie, Lei
    INTERSPEECH 2023, 2023, : 3487 - 3491
  • [2] CONTINUOUS STREAMING MULTI-TALKER ASR WITH DUAL-PATH TRANSDUCERS
    Raj, Desh
    Lu, Liang
    Chen, Zhuo
    Gaur, Yashesh
    Li, Jinyu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7317 - 7321
  • [3] Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
    Kanda, Naoyuki
    Wu, Jian
    Wu, Yu
    Xiao, Xiong
    Meng, Zhong
    Wang, Xiaofei
    Gaur, Yashesh
    Chen, Zhuo
    Li, Jinyu
    Yoshioka, Takuya
    INTERSPEECH 2022, 2022, : 521 - 525
  • [4] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
    Lu, Liang
    Li, Jinyu
    Gong, Yifan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
  • [5] Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR
    von Neumann, Thilo
    Boeddeker, Christoph
    Drude, Lukas
    Kinoshita, Keisuke
    Delcroix, Marc
    Nakatani, Tomohiro
    Haeb-Umbach, Reinhold
    INTERSPEECH 2020, 2020, : 3097 - 3101
  • [6] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2021, 2021, : 1782 - 1786
  • [7] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
  • [8] Token-level Adaptive Training for Neural Machine Translation
    Gu, Shuhao
    Zhang, Jinchao
    Meng, Fandong
    Feng, Yang
    Xie, Wanying
    Zhou, Jie
    Yu, Dong
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1035 - 1046
  • [9] Recognizing Multi-talker Speech with Permutation Invariant Training
    Yu, Dong
    Chang, Xuankai
    Qian, Yanmin
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2456 - 2460
  • [10] Knowledge Distillation for End-to-End Monaural Multi-talker ASR System
    Zhang, Wangyou
    Chang, Xuankai
    Qian, Yanmin
    INTERSPEECH 2019, 2019, : 2633 - 2637