BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

被引：1

作者：

Liang, Yuhao ^{[1
]}

Yu, Fan ^{[2
]}

Li, Yangze ^{[1
]}

Guo, Pengcheng ^{[1
]}

Zhang, Shiliang ^{[2
]}

Chen, Qian ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China

[2] Alibaba Grp, Speech Lab DAMO Acad, Hangzhou, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

automatic speech recognition; multi-talker; multi-task learning;

D O I：

10.21437/Interspeech.2023-1521

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

引用

页码：3487 / 3491

页数：5

共 4 条

[1] Streaming Multi-Talker ASR with Token-Level Serialized Output Training
Kanda, Naoyuki
Wu, Jian
Wu, Yu
Xiao, Xiong
Meng, Zhong
Wang, Xiaofei
Gaur, Yashesh
Chen, Zhuo
Li, Jinyu
Yoshioka, Takuya
INTERSPEECH 2022, 2022, : 3774 - 3778
[2] Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR
von Neumann, Thilo
Boeddeker, Christoph
Drude, Lukas
Kinoshita, Keisuke
Delcroix, Marc
Nakatani, Tomohiro
Haeb-Umbach, Reinhold
INTERSPEECH 2020, 2020, : 3097 - 3101
[3] Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
Kanda, Naoyuki
Ye, Guoli
Wu, Yu
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
INTERSPEECH 2021, 2021, : 3430 - 3434
[4] C2BA-UNet: A context-coordination multi-atlas boundary-aware UNet-like method for PET/CT images based tumor segmentation
Luo, Shijie
Jiang, Huiyan
Wang, Meng
COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2023, 103

← 1 →