DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors

被引：2

作者：

Landini, Federico ^{[1
]}

Diez, Mireia ^{[1
]}

Stafylakis, Themos ^{[2
,3
]}

Burget, Lukas ^{[1
]}

机构：

[1] Brno Univ Technol, Brno 61266, Czech Republic

[2] Omilia, Maroussi 15126, Greece

[3] Athens Univ Econ & Business, Athina 10434, Greece

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

欧盟地平线“2020”; 美国国家科学基金会;

关键词：

Decoding; Long short term memory; Biological system modeling; Vectors; Oral communication; Data models; Training; Attractor; DiaPer; end-to-end neural diarization; perceiver; speaker diarization;

D O I：

10.1109/TASLP.2024.3422818

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.

引用

页码：3450 / 3465

页数：16

共 91 条

[21] Fujita Y, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P296, DOI [10.1109/ASRU46091.2019.9003959, 10.1109/asru46091.2019.9003959]
[22] Graff D., 2002, Switchboard-2 phase iii audio
[23] Graff D., 1998, Switchboard-2 phase I, LDC98S75
[24] Graff D., 2004, Switchboard Cellular Part 2 Audio
[25] Graff D., 1999, SWITCHBOARD 2 PHASE
[26] Graff D., 2001, Switchboard cellular part 1 audio
[27] Conformer: Convolution-augmented Transformer for Speech Recognition
Gulati, Anmol
Qin, James
Chiu, Chung-Cheng
Parmar, Niki
Zhang, Yu
Yu, Jiahui
Han, Wei
Wang, Shibo
Zhang, Zhengdong
Wu, Yonghui
Pang, Ruoming
[J]. INTERSPEECH 2020, 2020, : 5036 - 5040
[28] BW-EDA-EEND: STREAMING END-TO-END NEURAL SPEAKER DIARIZATION FOR A VARIABLE NUMBER OF SPEAKERS
Han, Eunjung
Lee, Chul
Stolcke, Andreas
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7193 - 7197
[29] End-to-end neural speaker diarization with an iterative adaptive attractor estimation
Hao, Fengyuan
Li, Xiaodong
Zheng, Chengshi
[J]. NEURAL NETWORKS, 2023, 166 : 566 - 578
[30] ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding
He, Mao-Kui
Du, Jun
Liu, Qing-Feng
Lee, Chin-Hui
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1561 - 1573

← 1 2 3 4 5 6 7 8 9 10 →