Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

被引：1

作者：

Lyu, Ke-Ming ^{[1
]}

Lyu, Ren-yuan ^{[1
]}

Chang, Hsien-Tsung ^{[1
,2
,3
]}

机构：

[1] Chang Gung Univ, Comp Sci & Informat Engn, Taoyuan, Taiwan

[2] Chang Gung Mem Hosp, Phys Med & Rehabil, Taoyuan, Taiwan

[3] Chang Gung Univ, Bachelor Program Artificial Intelligence, Taoyuan, Taiwan

来源：

PEERJ COMPUTER SCIENCE | 2024年 / 10卷

关键词：

Automatic speech recognition; Speaker diarization; Real-time system; Incremental clustering;

D O I：

10.7717/peerj-cs.1973

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This research presents the development of a cutting -edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI's Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in twospeaker scenarios and 11.65% in three -speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non -real-time baseline models, highlighting the system's ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.

引用

页数：19

共 19 条

[1] Bain M, 2023, WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
[2] Bain Max, 2023, INTERSPEECH 2023
[3] Bredin H., 2023, P INTERSPEECH 2023
[4] Bredin H, 2020, INT CONF ACOUST SPEE, P7124, DOI [10.1109/icassp40776.2020.9052974, 10.1109/ICASSP40776.2020.9052974]
[5] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[6] OVERLAP-AWARE LOW-LATENCY ONLINE SPEAKER DIARIZATION BASED ON END-TO-END LOCAL SEGMENTATION
Coria, Juan M.
Bredin, Herve
Ghannay, Sahar
Rosset, Sophie
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1139 - 1146
[7] Dehak N, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P1527
[8] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Desplanques, Brecht
Thienpondt, Jenthe
Demuynck, Kris
[J]. INTERSPEECH 2020, 2020, : 3830 - 3834
[9] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[10] Joint Speech Recognition and Speaker Diarization via Sequence Transduction
El Shafey, Laurent
Soltau, Hagen
Shafran, Izhak
[J]. INTERSPEECH 2019, 2019, : 396 - 400

← 1 2 →