Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

被引:2
|
作者
Guo, Pengcheng [1 ]
Chang, Xuankai [2 ]
Watanabe, Shinji [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, ASLP NPU, Xian, Peoples R China
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
INTERSPEECH 2021 | 2021年
关键词
Non-autoregressive; conditional chain model; multi-speaker speech recognition; SEPARATION;
D O I
10.21437/Interspeech.2021-2155
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at hfips://github.com/pengchengguo/espnet/tree/conditional-multispk.
引用
收藏
页码:3720 / 3724
页数:5
相关论文
共 28 条
  • [1] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
  • [2] Speaker conditioned acoustic modeling for multi-speaker conversational ASR
    Chetupalli, Srikanth Raj
    Ganapathy, Sriram
    INTERSPEECH 2022, 2022, : 3834 - 3838
  • [3] STREAMING MULTI-SPEAKER ASR WITH RNN-T
    Sklyar, Ilya
    Piunova, Anna
    Liu, Yulan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6903 - 6907
  • [4] NON-AUTOREGRESSIVE TRANSFORMER ASR WITH CTC-ENHANCED DECODER INPUT
    Song, Xingchen
    Wu, Zhiyong
    Huang, Yiheng
    Weng, Chao
    Su, Dan
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5894 - 5898
  • [5] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
    Higuchi, Yosuke
    Watanabe, Shinji
    Chen, Nanxin
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    INTERSPEECH 2020, 2020, : 3655 - 3659
  • [6] END-TO-END MULTI-SPEAKER ASR WITH INDEPENDENT VECTOR ANALYSIS
    Scheibler, Robin
    Zhang, Wangyou
    Chang, Xuankai
    Watanabe, Shinji
    Qian, Yanmin
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 496 - 501
  • [7] IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR
    Higuchi, Yosuke
    Inaguma, Hirofumi
    Watanabe, Shinji
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8363 - 8367
  • [8] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [9] EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR
    Chang, Xuankai
    Moritz, Niko
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7322 - 7326
  • [10] END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING
    Chang, Xuankai
    Qian, Yanmin
    Yu, Kai
    Watanabe, Shinji
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6256 - 6260