INVESTIGATION OF END-TO-END SPEAKER-ATTRIBUTED ASR FOR CONTINUOUS MULTI-TALKER RECORDINGS

被引：21

作者：

Kanda, Naoyuki ^{[1
]}

Chang, Xuankai ^{[2
,3
]}

Gaur, Yashesh ^{[1
]}

Wang, Xiaofei ^{[1
]}

Meng, Zhong ^{[1
]}

Chen, Zhuo ^{[1
]}

Yoshioka, Takuya ^{[1
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

[3] Microsoft, Redmond, WA USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Rich transcription; speech recognition; speaker identification; speaker diarization; serialized output training; OVERLAPPED SPEECH; DIARIZATION;

D O I：

10.1109/SLT48900.2021.9383600

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. It showed promising results for simulated speech mixtures consisting of various numbers of speakers. However, the model required prior knowledge of speaker profiles to perform speaker identification, which significantly limited the application of the model. In this paper, we extend the prior work by addressing the case where no speaker profile is available. Specifically, we perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model to diarize the utterances of the speakers whose profiles are missing from the speaker inventory. We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well. We conduct a comprehensive investigation of the original E2E SA-ASR and the proposed method on the monaural LibriCSS dataset. Compared to the original E2E SA-ASR with relevant speaker profiles, the proposed method achieves a close performance without any prior speaker knowledge. We also show that the source-target attention in the E2E SA-ASR model provides information about the start and end times of the hypotheses.

引用

页码：809 / 816

页数：8

共 49 条

[1]

[Anonymous], 2019, P INTERSPEECH

[2]

[Anonymous], 2011, IEEE WORKSH AUT SPEE

[3]

Ba J. L., 2016, P ADV NEUR INF PROC

[4] The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines [J].

Barker, Jon ;

Watanabe, Shinji ;

Vincent, Emmanuel ;

Trmal, Jan .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1561-1565

[5]

Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28

[6]

Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]

[7]

Chang XK, 2019, INT CONF ACOUST SPEE, P6256, DOI 10.1109/ICASSP.2019.8682822

[8]

Chen Z, 2017, INT CONF ACOUST SPEE, P246, DOI 10.1109/ICASSP.2017.7952155

[9]

Chen Zhuo, 2020, P ICASSP

[10]

Chen Zhuo, P SLT, P2021

← 1 2 3 4 5 →