Streaming Multi-Talker ASR with Token-Level Serialized Output Training

被引:8
|
作者
Kanda, Naoyuki [1 ]
Wu, Jian [1 ]
Wu, Yu [2 ]
Xiao, Xiong [1 ]
Meng, Zhong [1 ]
Wang, Xiaofei [1 ]
Gaur, Yashesh [1 ]
Chen, Zhuo [1 ]
Li, Jinyu [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Cloud AI, Redmond, WA 98052 USA
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
INTERSPEECH 2022 | 2022年
关键词
multi-talker speech recognition; serialized output training; streaming inference; SPEECH;
D O I
10.21437/Interspeech.2022-7
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.
引用
收藏
页码:3774 / 3778
页数:5
相关论文
共 38 条
  • [21] Tuning Multi-mode Token-level Prompt Alignment across Modalities
    Wang, Dongsheng
    Li, Miaoge
    Liu, Xinyang
    Xu, MingSheng
    Chen, Bo
    Zhang, Hanwang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [22] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
    Li, Zhuohan
    Zhuang, Siyuan
    Guo, Shiyuan
    Zhuo, Danyang
    Zhang, Hao
    Song, Dawn
    Stoica, Ion
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [23] ADAPTIVE PERMUTATION INVARIANT TRAINING WITH AUXILIARY INFORMATION FOR MONAURAL MULTI-TALKER SPEECH RECOGNITION
    Chang, Xuankai
    Qian, Yanmin
    Yu, Dong
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5974 - 5978
  • [24] Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    Fadeeva, Ekaterina
    Rubashevskii, Aleksandr
    Shelmanov, Artem
    Petrakov, Sergey
    Li, Haonan
    Mubarak, Hamdy
    Tsymbalov, Evgenii
    Kuzmin, Gleb
    Panchenko, Alexander
    Baldwin, Timothy
    Nakov, Preslav
    Panov, Maxim
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 9367 - 9385
  • [25] A unified multi-task learning model for AST-level and token-level code completion
    Liu, Fang
    Li, Ge
    Wei, Bolin
    Xia, Xin
    Fu, Zhiyi
    Jin, Zhi
    EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (04)
  • [26] Multi-paragraph Reading Comprehension with Token-level Dynamic Reader and Hybrid Verifier
    Dai, Yilin
    Ji, Qian
    Liu, Gongshen
    Su, Bo
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [27] AN ANALYSIS OF SPEECH ENHANCEMENT AND RECOGNITION LOSSES IN LIMITED RESOURCES MULTI-TALKER SINGLE CHANNEL AUDIO-VISUAL ASR
    Pasa, Luca
    Morrone, Giovanni
    Badino, Leonardo
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7309 - 7313
  • [28] HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
    Chang, Xuankai
    Kanda, Naoyuki
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Yoshioka, Takuya
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6763 - 6767
  • [29] A unified multi-task learning model for AST-level and token-level code completion
    Fang Liu
    Ge Li
    Bolin Wei
    Xin Xia
    Zhiyi Fu
    Zhi Jin
    Empirical Software Engineering, 2022, 27
  • [30] Multi-Agent Mutual Learning at Sentence-Level and Token-Level for Neural Machine Translation
    Liao, Baohao
    Gao, Yingbo
    Ney, Hermann
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1715 - 1724