Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

被引：2

作者：

Guo, Pengcheng ^{[1
]}

Chang, Xuankai ^{[2
]}

Watanabe, Shinji ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, ASLP NPU, Xian, Peoples R China

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

Non-autoregressive; conditional chain model; multi-speaker speech recognition; SEPARATION;

D O I：

10.21437/Interspeech.2021-2155

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at hfips://github.com/pengchengguo/espnet/tree/conditional-multispk.

引用

页码：3720 / 3724

页数：5

共 28 条

[1] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
[2] Speaker conditioned acoustic modeling for multi-speaker conversational ASR
Chetupalli, Srikanth Raj
Ganapathy, Sriram
INTERSPEECH 2022, 2022, : 3834 - 3838
[3] STREAMING MULTI-SPEAKER ASR WITH RNN-T
Sklyar, Ilya
Piunova, Anna
Liu, Yulan
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6903 - 6907
[4] NON-AUTOREGRESSIVE TRANSFORMER ASR WITH CTC-ENHANCED DECODER INPUT
Song, Xingchen
Wu, Zhiyong
Huang, Yiheng
Weng, Chao
Su, Dan
Meng, Helen
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5894 - 5898
[5] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
Higuchi, Yosuke
Watanabe, Shinji
Chen, Nanxin
Ogawa, Tetsuji
Kobayashi, Tetsunori
INTERSPEECH 2020, 2020, : 3655 - 3659
[6] END-TO-END MULTI-SPEAKER ASR WITH INDEPENDENT VECTOR ANALYSIS
Scheibler, Robin
Zhang, Wangyou
Chang, Xuankai
Watanabe, Shinji
Qian, Yanmin
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 496 - 501
[7] IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR
Higuchi, Yosuke
Inaguma, Hirofumi
Watanabe, Shinji
Ogawa, Tetsuji
Kobayashi, Tetsunori
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8363 - 8367
[8] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
Rybicka, Magdalena
Villalba, Jesus
Thebaud, Thomas
Dehak, Najim
Kowalczyk, Konrad
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
[9] EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR
Chang, Xuankai
Moritz, Niko
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7322 - 7326
[10] END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING
Chang, Xuankai
Qian, Yanmin
Yu, Kai
Watanabe, Shinji
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6256 - 6260

← 1 2 3 →