Streaming Multi-Talker ASR with Token-Level Serialized Output Training

被引：8

作者：

Kanda, Naoyuki ^{[1
]}

Wu, Jian ^{[1
]}

Wu, Yu ^{[2
]}

Xiao, Xiong ^{[1
]}

Meng, Zhong ^{[1
]}

Wang, Xiaofei ^{[1
]}

Gaur, Yashesh ^{[1
]}

Chen, Zhuo ^{[1
]}

Li, Jinyu ^{[1
]}

Yoshioka, Takuya ^{[1
]}

机构：

[1] Microsoft Cloud AI, Redmond, WA 98052 USA

[2] Microsoft Res Asia, Beijing, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

multi-talker speech recognition; serialized output training; streaming inference; SPEECH;

D O I：

10.21437/Interspeech.2022-7

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

引用

页码：3774 / 3778

页数：5

共 38 条

[11] Target identification using relative level in multi-talker listening
Kitterick, Padraig T.
Clarke, Emmet
O'Shea, Charlotte
Seymour, Josephine
Summerfield, A. Quentin
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 133 (05): : 2899 - 2909
[12] Target identification using relative level in multi-talker listening
Kitterick, Pádraig T.
Clarke, Emmet
Oshea, Charlotte
Seymour, Josephine
Quentin Summerfield, A.
Journal of the Acoustical Society of America, 2013, 133 (05): : 2899 - 2909
[13] STREAMING NOISE CONTEXT AWARE ENHANCEMENT FOR AUTOMATIC SPEECH RECOGNITION IN MULTI-TALKER ENVIRONMENTS
Caroselli, Joe
Narayanan, Arun
Huang, Yiteng
2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022), 2022,
[14] INVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FOR CONTINUOUS MULTI-TALKER RECORDINGS
Kanda, Naoyuki
Chang, Xuankai
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 809 - 816
[15] Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming
Yin, Lu
Wang, Ziteng
Xia, Risheng
Li, Junfeng
Yan, Yonghong
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 851 - 855
[16] Training-induced brain activation and functional connectivity differentiate multi-talker and single-talker speech training
Deng, Zhizhou
Chandrasekaran, Bharath
Wang, Suiping
Wong, Patrick C. M.
NEUROBIOLOGY OF LEARNING AND MEMORY, 2018, 151 : 1 - 9
[17] Single-channel multi-talker speech recognition with permutation invariant training
Qian, Yanmin
Chang, Xuankai
Yu, Dong
SPEECH COMMUNICATION, 2018, 104 : 1 - 11
[18] Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
Kanda, Naoyuki
Ye, Guoli
Wu, Yu
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
INTERSPEECH 2021, 2021, : 3430 - 3434
[19] Token-Level Self-Evolution Training for Sequence-to-Sequence Learning
Peng, Keqin
Ding, Liang
Zhong, Qihuang
Ouyang, Yuanxin
Rong, Wenge
Xiong, Zhang
Tao, Dacheng
61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 841 - 850
[20] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
Rose, Richard
Siohan, Olivier
INTERSPEECH 2022, 2022, : 2828 - 2832

← 1 2 3 4 →