Online Neural Speaker Diarization With Target Speaker Tracking

被引:0
|
作者
Wang, Weiqing [1 ]
Li, Ming [1 ,2 ]
机构
[1] Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA
[2] Duke Kunshan Univ, Suzhou Municipal Key Lab Multimodal Intelligent Sy, Kunshan 215306, Peoples R China
基金
中国国家自然科学基金;
关键词
Voice activity detection; Clustering algorithms; Acoustics; Real-time systems; Vectors; Speech enhancement; Training; Target tracking; Low latency communication; Automatic speech recognition; Speaker diarization; online speaker diarization; target speaker voice activity detection; SPEECH; RECOGNITION; IDENTIFICATION; SEPARATION; VOXCELEB; NET;
D O I
10.1109/TASLP.2024.3507559
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.
引用
收藏
页码:5078 / 5091
页数:14
相关论文
共 50 条
  • [1] Online Target Speaker Voice Activity Detection for Speaker Diarization
    Wang, Weiqing
    Lin, Qingjian
    Li, Ming
    INTERSPEECH 2022, 2022, : 1441 - 1445
  • [2] Speaker-Corrupted Embeddings for Online Speaker Diarization
    Ghahabi, Omid
    Fischer, Volker
    INTERSPEECH 2019, 2019, : 386 - 390
  • [3] Online Neural Speaker Diarization with Spectral Clustering for Meeting Scenarios
    Cheng, Tianyou
    He, Maokui
    Yang, Gaobin
    Niu, Shutong
    Lei, Yanqiang
    Peng, Limei
    Du, Jun
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 373 - 377
  • [4] A Hybrid Approach to Online Speaker Diarization
    Vaquero, Carlos
    Vinyals, Oriol
    Friedland, Gerald
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2646 - +
  • [5] Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech
    Zajic, Zbynek
    Zelinka, Jan
    Mueller, Ludek
    SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 555 - 563
  • [6] Ideas for Clustering of Similar Models of a Speaker in an Online Speaker Diarization System
    Kunesova, Marie
    Radova, Vlasta
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 225 - 233
  • [7] End-to-End Neural Speaker Diarization with Absolute Speaker Loss
    Wang, Chao
    Li, Jie
    Fang, Xiang
    Kang, Jian
    Li, Yongxiang
    INTERSPEECH 2023, 2023, : 3577 - 3581
  • [8] Experiments with Segmentation in an Online Speaker Diarization System
    Kunesova, Marie
    Zajic, Zbynek
    Radova, Vlasta
    TEXT, SPEECH, AND DIALOGUE, TSD 2017, 2017, 10415 : 429 - 437
  • [9] Online Meeting Recognizer with Multichannel Speaker Diarization
    Araki, Shoko
    Hori, Takaaki
    Fujimoto, Masakiyo
    Watanabe, Shinji
    Yoshioka, Takuya
    Nakatani, Tomohiro
    Nakamura, Atsushi
    2010 CONFERENCE RECORD OF THE FORTY FOURTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS (ASILOMAR), 2010, : 1697 - 1701
  • [10] ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA
    Soldi, Giovanni
    Beaugeant, Christophe
    Evans, Nicholas
    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 2112 - 2116