Online Neural Speaker Diarization With Target Speaker Tracking

被引：0

作者：

Wang, Weiqing ^{[1
]}

Li, Ming ^{[1
,2
]}

机构：

[1] Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA

[2] Duke Kunshan Univ, Suzhou Municipal Key Lab Multimodal Intelligent Sy, Kunshan 215306, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Voice activity detection; Clustering algorithms; Acoustics; Real-time systems; Vectors; Speech enhancement; Training; Target tracking; Low latency communication; Automatic speech recognition; Speaker diarization; online speaker diarization; target speaker voice activity detection; SPEECH; RECOGNITION; IDENTIFICATION; SEPARATION; VOXCELEB; NET;

D O I：

10.1109/TASLP.2024.3507559

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.

引用

页码：5078 / 5091

页数：14

共 50 条

[1] Online Target Speaker Voice Activity Detection for Speaker Diarization
Wang, Weiqing
Lin, Qingjian
Li, Ming
INTERSPEECH 2022, 2022, : 1441 - 1445
[2] Speaker-Corrupted Embeddings for Online Speaker Diarization
Ghahabi, Omid
Fischer, Volker
INTERSPEECH 2019, 2019, : 386 - 390
[3] Online Neural Speaker Diarization with Spectral Clustering for Meeting Scenarios
Cheng, Tianyou
He, Maokui
Yang, Gaobin
Niu, Shutong
Lei, Yanqiang
Peng, Limei
Du, Jun
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 373 - 377
[4] A Hybrid Approach to Online Speaker Diarization
Vaquero, Carlos
Vinyals, Oriol
Friedland, Gerald
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2646 - +
[5] Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech
Zajic, Zbynek
Zelinka, Jan
Mueller, Ludek
SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 555 - 563
[6] Ideas for Clustering of Similar Models of a Speaker in an Online Speaker Diarization System
Kunesova, Marie
Radova, Vlasta
TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 225 - 233
[7] End-to-End Neural Speaker Diarization with Absolute Speaker Loss
Wang, Chao
Li, Jie
Fang, Xiang
Kang, Jian
Li, Yongxiang
INTERSPEECH 2023, 2023, : 3577 - 3581
[8] Experiments with Segmentation in an Online Speaker Diarization System
Kunesova, Marie
Zajic, Zbynek
Radova, Vlasta
TEXT, SPEECH, AND DIALOGUE, TSD 2017, 2017, 10415 : 429 - 437
[9] Online Meeting Recognizer with Multichannel Speaker Diarization
Araki, Shoko
Hori, Takaaki
Fujimoto, Masakiyo
Watanabe, Shinji
Yoshioka, Takuya
Nakatani, Tomohiro
Nakamura, Atsushi
2010 CONFERENCE RECORD OF THE FORTY FOURTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS (ASILOMAR), 2010, : 1697 - 1701
[10] ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA
Soldi, Giovanni
Beaugeant, Christophe
Evans, Nicholas
2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 2112 - 2116

← 1 2 3 4 5 →