Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding

被引：0

作者：

Kim, Minsoo ^{[1
]}

Jang, Gil-Jin ^{[1
]}

机构：

[1] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 18期

关键词：

speech recognition; speaker embedding; speaker-attributed training; SEPARATION;

D O I：

10.3390/app14188138

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Featured Application Speech recognition; speaker adaptation; speaker diarization.Abstract Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers' voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER).

引用

页数：17

共 45 条

[1] Baevski A, 2020, ADV NEUR IN, V33
[2] Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/icassp40776.2020.9054029, 10.1109/ICASSP40776.2020.9054029]
[3] Chang XK, 2019, INT CONF ACOUST SPEE, P6256, DOI 10.1109/ICASSP.2019.8682822
[4] Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks
Chang, Xuankai
Qian, Yanmin
Yu, Dong
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1586 - 1590
[5] Chen Z., 2023, P IEEE INT C AC SPEE, P1, DOI [10.1109/ICASSP49357.2023.10096876, DOI 10.1109/ICASSP49357.2023.10096876]
[6] SOME EXPERIMENTS ON THE RECOGNITION OF SPEECH, WITH ONE AND WITH 2 EARS
CHERRY, EC
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1953, 25 (05) : 975 - 979
[7] Cosentino J, 2020, Arxiv, DOI arXiv:2005.11262
[8] Front-End Factor Analysis for Speaker Verification
Dehak, Najim
Kenny, Patrick J.
Dehak, Reda
Dumouchel, Pierre
Ouellet, Pierre
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 788 - 798
[9] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 4425 - 4429
[10] Garcia-Romero D, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P256

← 1 2 3 4 5 →