Sequence-to-Sequence Neural Diarization With Automatic Speaker Detection and Representation

被引:0
作者
Cheng, Ming [1 ]
Lin, Yuke [1 ]
Li, Ming [1 ,2 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
[2] Duke Kunshan Univ, Digital Innovat Res Ctr, Suzhou Municipal Key Lab Multimodal Intelligent Sy, Kunshan 215316, Peoples R China
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷
基金
中国国家自然科学基金;
关键词
Voice activity detection; Adaptation models; Data mining; Decoding; Training; Real-time systems; Feature extraction; Network architecture; Neural networks; Low latency communication; Online speaker diarization; sequence-to-sequence neural diarization; speaker diarization; SYSTEM;
D O I
10.1109/taslpro.2025.3581032; 10.1109/TASLPRO.2025.3581032
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a Sequence-to-Sequence Neural Diarization (S2SND) framework for both online and offline speaker diarization. Built upon a sequence-to-sequence architecture, S2SND integrates two novel components: (1) a masked speaker prediction mechanism that enables the model to detect unknown speakers without preextracted embeddings and (2) a target-voice speaker embedding extraction module that infers speaker representations using predicted voice activities as reference. This joint modeling approach eliminates the need for unsupervised clustering or permutation-invariant training. During inference, S2SND processes long-form audio in continuous blocks, leveraging a speaker-embedding buffer to maintain speaker consistency and enable low-latency prediction. It supports real-time detection of new speakers and efficient rescoring for improved offline performance. The entire diarization network is trained end-to-end, with binary cross-entropy and ArcFace loss guiding the detection and representation branches, respectively. Experimental resultsdemonstrate that S2SND achieves state-of-the-art diarization error rates (DERs) across multiple conditions. Specifically, it achieves DERs of 24.41% (online) and 21.95% (offline) on DIHARD-II, and 17.12% (online) and 15.13% (offline) on DIHARDIII, without using oracle voice activity detection. latex version:This paper proposes a Sequence-to-Sequence Neural Diarization (S2SND) framework for both online and offline speaker diarization. Built upon a sequence-to-sequence architecture, S2SND integrates two novel components: (1) a masked speaker prediction mechanism that enables the model to detect unknown speakers without pre-extracted embeddings and (2) a target-voice speaker embedding extraction module that infers speaker representations using predicted voice activities as reference. This joint modeling approach eliminates the need for unsupervised lustering or permutation-invariant training. During inference, S2SND processes longform audio in continuous blocks, leveraging a speaker-embedding buffer to maintain speaker consistency and enable low-latency prediction. It supports real-time detection of new speakers and efficient rescoring for improved offline performance. The entire diarization network is trained end-to-end, with binary cross-entropy and ArcFace loss guiding the detection and representation branches, respectively. Experimental results demonstrate that S2SND achieves state-of-the-art diarization error rates (DERs) across multiple conditions. Specifically, it achieves DERs of 24.41\% (online) and 21.95\% (offline) on DIHARD-II, and 17.12\% (online) and 15.13\% (offline) on DIHARD-III, without using oracle voice activity detection.
引用
收藏
页码:2719 / 2734
页数:16
相关论文
共 90 条
[1]   End-to-end speaker segmentation for overlap-aware resegmentation [J].
Bredin, Herve ;
Laurent, Antoine .
INTERSPEECH 2021, 2021, :3111-3115
[2]  
Chang SY, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5549, DOI 10.1109/ICASSP.2018.8461921
[3]   ENHANCING LOW-LATENCY SPEAKER DIARIZATION WITH SPATIAL DICTIONARY LEARNING [J].
Chen, Weiguang ;
Tran The Anh ;
Zhong, Xionghu ;
Chng, Eng Siong .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :11371-11375
[4]   Interrelate Training and Clustering for Online Speaker Diarization [J].
Chen, Yifan ;
Cheng, Gaofeng ;
Yang, Runyan ;
Zhang, Pengyuan ;
Yan, Yonghong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :1352-1364
[5]  
Chen Z., 2025, P ICASSP, P1
[6]   Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer [J].
Chen, Zhengyang ;
Han, Bing ;
Wang, Shuai ;
Qian, Yanmin .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :1636-1649
[7]   Multi-Target Extractor and Detector for Unknown-Number Speaker Diarization [J].
Cheng, Chin-Yi ;
Lee, Hung-Shin ;
Tsao, Yu ;
Wang, Hsin-Min .
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 :638-642
[8]  
Cheng Ming, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1, DOI 10.1109/ICASSP49357.2023.10095802
[9]  
Cheng M., 2023, P IEEE INT C AC SPEE, P1
[10]   The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023 [J].
Cheng, Ming ;
Wang, Weiqing ;
Qin, Xiaoyi ;
Lin, Yuke ;
Jiang, Ning ;
Zhao, Guoqing ;
Li, Ming .
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2023, 2024, 2006 :330-337