End-to-End Speaker-Attributed ASR with Transformer

被引：11

作者：

Kanda, Naoyuki ^{[1
]}

Ye, Guoli ^{[1
]}

Gaur, Yashesh ^{[1
]}

Wang, Xiaofei ^{[1
]}

Meng, Zhong ^{[1
]}

Chen, Zhuo ^{[1
]}

Yoshioka, Takuya ^{[1
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;

D O I：

10.21437/Interspeech.2021-101

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.

引用

页码：4413 / 4417

页数：5

共 50 条

[21] Data Augmentation Using CycleGAN for End-to-End Children ASR
Singh, Dipesh K.
Amin, Preet P.
Sailor, Hardik B.
Patil, Hemant A.
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
[22] TWO-PASS END-TO-END ASR MODEL COMPRESSION
Dawalatabad, Nauman
Vatsal, Tushar
Gupta, Ashutosh
Kim, Sungsoo
Singh, Shatrughan
Gowda, Dhananjaya
Kim, Chanwoo
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 403 - 410
[23] End-to-End ASR with Adaptive Span Self-Attention
Chang, Xuankai
Subramanian, Aswin Shanmugam
Guo, Pengcheng
Watanabe, Shinji
Fujita, Yuya
Omachi, Motoi
INTERSPEECH 2020, 2020, : 3595 - 3599
[24] Extremely Low Footprint End-to-End ASR System for Smart Device
Gao, Zhifu
Yao, Yiwu
Zhang, Shiliang
Yang, Jun
Lei, Ming
McLoughlin, Ian
INTERSPEECH 2021, 2021, : 4548 - 4552
[25] Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
Joshi, Vikas
Das, Amit
Sun, Eric
Mehta, Rupesh R.
Li, Jinyu
Gong, Yifan
INTERSPEECH 2021, 2021, : 1767 - 1771
[26] Using Large Language Model for End-to-End Chinese ASR and NER
Li, Yuang
Yu, Jiawei
Zhang, Min
Ren, Mengxin
Zhao, Yanqing
Zhao, Xiaofeng
Tao, Shimin
Su, Jinsong
Yang, Hao
INTERSPEECH 2024, 2024, : 822 - 826
[27] BACK-TRANSLATION-STYLE DATA AUGMENTATION FOR END-TO-END ASR
Hayashi, Tomoki
Watanabe, Shinji
Zhang, Yu
Toda, Tomoki
Hori, Takaaki
Astudillo, Ramon
Takeda, Kazuya
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 426 - 433
[28] Improved Relation Networks for End-to-End Speaker Verification and Identification
Chaubey, Ashutosh
Sinha, Sparsh
Ghose, Susmita
INTERSPEECH 2022, 2022, : 5085 - 5089
[29] End-to-end recurrent denoising autoencoder embeddings for speaker identification
Esther Rituerto-González
Carmen Peláez-Moreno
Neural Computing and Applications, 2021, 33 : 14429 - 14439
[30] End-to-end Neural Diarization: From Transformer to Conformer
Liu, Yi Chieh
Han, Eunjung
Lee, Chul
Stolcke, Andreas
INTERSPEECH 2021, 2021, : 3081 - 3085

← 1 2 3 4 5 →