Streaming Multi-Talker ASR with Token-Level Serialized Output Training

被引:8
|
作者
Kanda, Naoyuki [1 ]
Wu, Jian [1 ]
Wu, Yu [2 ]
Xiao, Xiong [1 ]
Meng, Zhong [1 ]
Wang, Xiaofei [1 ]
Gaur, Yashesh [1 ]
Chen, Zhuo [1 ]
Li, Jinyu [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Cloud AI, Redmond, WA 98052 USA
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
INTERSPEECH 2022 | 2022年
关键词
multi-talker speech recognition; serialized output training; streaming inference; SPEECH;
D O I
10.21437/Interspeech.2022-7
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.
引用
收藏
页码:3774 / 3778
页数:5
相关论文
共 38 条
  • [11] Target identification using relative level in multi-talker listening
    Kitterick, Padraig T.
    Clarke, Emmet
    O'Shea, Charlotte
    Seymour, Josephine
    Summerfield, A. Quentin
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 133 (05): : 2899 - 2909
  • [12] Target identification using relative level in multi-talker listening
    Kitterick, Pádraig T.
    Clarke, Emmet
    Oshea, Charlotte
    Seymour, Josephine
    Quentin Summerfield, A.
    Journal of the Acoustical Society of America, 2013, 133 (05): : 2899 - 2909
  • [13] STREAMING NOISE CONTEXT AWARE ENHANCEMENT FOR AUTOMATIC SPEECH RECOGNITION IN MULTI-TALKER ENVIRONMENTS
    Caroselli, Joe
    Narayanan, Arun
    Huang, Yiteng
    2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022), 2022,
  • [14] INVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FOR CONTINUOUS MULTI-TALKER RECORDINGS
    Kanda, Naoyuki
    Chang, Xuankai
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 809 - 816
  • [15] Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming
    Yin, Lu
    Wang, Ziteng
    Xia, Risheng
    Li, Junfeng
    Yan, Yonghong
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 851 - 855
  • [16] Training-induced brain activation and functional connectivity differentiate multi-talker and single-talker speech training
    Deng, Zhizhou
    Chandrasekaran, Bharath
    Wang, Suiping
    Wong, Patrick C. M.
    NEUROBIOLOGY OF LEARNING AND MEMORY, 2018, 151 : 1 - 9
  • [17] Single-channel multi-talker speech recognition with permutation invariant training
    Qian, Yanmin
    Chang, Xuankai
    Yu, Dong
    SPEECH COMMUNICATION, 2018, 104 : 1 - 11
  • [18] Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
    Kanda, Naoyuki
    Ye, Guoli
    Wu, Yu
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    INTERSPEECH 2021, 2021, : 3430 - 3434
  • [19] Token-Level Self-Evolution Training for Sequence-to-Sequence Learning
    Peng, Keqin
    Ding, Liang
    Zhong, Qihuang
    Ouyang, Yuanxin
    Rong, Wenge
    Xiong, Zhang
    Tao, Dacheng
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 841 - 850
  • [20] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
    Rose, Richard
    Siohan, Olivier
    INTERSPEECH 2022, 2022, : 2828 - 2832