BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

被引:1
|
作者
Liang, Yuhao [1 ]
Yu, Fan [2 ]
Li, Yangze [1 ]
Guo, Pengcheng [1 ]
Zhang, Shiliang [2 ]
Chen, Qian [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Alibaba Grp, Speech Lab DAMO Acad, Hangzhou, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
automatic speech recognition; multi-talker; multi-task learning;
D O I
10.21437/Interspeech.2023-1521
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.
引用
收藏
页码:3487 / 3491
页数:5
相关论文
共 4 条
  • [1] Streaming Multi-Talker ASR with Token-Level Serialized Output Training
    Kanda, Naoyuki
    Wu, Jian
    Wu, Yu
    Xiao, Xiong
    Meng, Zhong
    Wang, Xiaofei
    Gaur, Yashesh
    Chen, Zhuo
    Li, Jinyu
    Yoshioka, Takuya
    INTERSPEECH 2022, 2022, : 3774 - 3778
  • [2] Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR
    von Neumann, Thilo
    Boeddeker, Christoph
    Drude, Lukas
    Kinoshita, Keisuke
    Delcroix, Marc
    Nakatani, Tomohiro
    Haeb-Umbach, Reinhold
    INTERSPEECH 2020, 2020, : 3097 - 3101
  • [3] Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
    Kanda, Naoyuki
    Ye, Guoli
    Wu, Yu
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    INTERSPEECH 2021, 2021, : 3430 - 3434
  • [4] C2BA-UNet: A context-coordination multi-atlas boundary-aware UNet-like method for PET/CT images based tumor segmentation
    Luo, Shijie
    Jiang, Huiyan
    Wang, Meng
    COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2023, 103