Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

被引:1
|
作者
Lyu, Ke-Ming [1 ]
Lyu, Ren-yuan [1 ]
Chang, Hsien-Tsung [1 ,2 ,3 ]
机构
[1] Chang Gung Univ, Comp Sci & Informat Engn, Taoyuan, Taiwan
[2] Chang Gung Mem Hosp, Phys Med & Rehabil, Taoyuan, Taiwan
[3] Chang Gung Univ, Bachelor Program Artificial Intelligence, Taoyuan, Taiwan
关键词
Automatic speech recognition; Speaker diarization; Real-time system; Incremental clustering;
D O I
10.7717/peerj-cs.1973
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This research presents the development of a cutting -edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI's Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in twospeaker scenarios and 11.65% in three -speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non -real-time baseline models, highlighting the system's ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Real-Time Speech Signal Segmentation Methods
    Kupryjanow, Adam
    Czyzewski, Andrzej
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2013, 61 (7-8): : 521 - 534
  • [42] A fast-match approach for robust, faster than real-time speaker diarization
    Huang, Yan
    Vinyals, Oriol
    Friedland, Gerald
    Mueller, Christian
    Mirghafori, Nikki
    Wooters, Chuck
    2007 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, VOLS 1 AND 2, 2007, : 693 - 698
  • [43] Real-time speech signal segmentation methods
    2013, Audio Engineering Society (61): : 7 - 8
  • [44] Speech Activity Detection Based on Multilingual Speech Recognition System
    Sarfjoo, Seyyed Saeed
    Madikeri, Srikanth
    Motlicek, Petr
    INTERSPEECH 2021, 2021, : 4369 - 4373
  • [45] Differential MFCC and Vector Quantization used for Real-Time Speaker Recognition System
    Chen, Wang
    Miao Zhenjiang
    Xiao, Meng
    CISP 2008: FIRST INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOL 5, PROCEEDINGS, 2008, : 319 - 323
  • [46] A real-time load positioning and recognition system integrating image segmentation
    Liu, Shiyu
    Liu, Jie
    Chen, Shichao
    Quan, Jiahao
    Deng, Jiukai
    Wei, Shangwan
    2023 IEEE 2ND INDUSTRIAL ELECTRONICS SOCIETY ANNUAL ON-LINE CONFERENCE, ONCON, 2023,
  • [47] A FLEXIBLE ARCHITECTURE FOR REAL-TIME SPEECH RECOGNITION
    MORENO, F
    ALEXANDRES, S
    MENESES, J
    MICROPROCESSING AND MICROPROGRAMMING, 1993, 37 (1-5): : 69 - 72
  • [48] Real-time recognition of broadcast radio speech
    Cook, GD
    Christie, JD
    Clarkson, PR
    Hochberg, MM
    Logan, BT
    Robinson, AJ
    Seymour, CW
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 141 - 144
  • [49] INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS
    Raj, Desh
    Denisov, Pavel
    Chen, Zhuo
    Erdogan, Hakan
    Huang, Zili
    He, Maokui
    Watanabe, Shinji
    Du, Jun
    Yoshioka, Takuya
    Luo, Yi
    Kanda, Naoyuki
    Li, Jinyu
    Wisdom, Scott
    Hershey, John R.
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 897 - 904
  • [50] Analysis of Oral Exams With Speaker Diarization and Speech Emotion Recognition: A Case Study
    Beccaro, Wesley
    Ramirez, Miguel Arjona
    Liaw, William
    Guimaraes, Heitor Rodrigues
    IEEE TRANSACTIONS ON EDUCATION, 2024, 67 (01) : 74 - 86