DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors

被引：2

作者：

Landini, Federico ^{[1
]}

Diez, Mireia ^{[1
]}

Stafylakis, Themos ^{[2
,3
]}

Burget, Lukas ^{[1
]}

机构：

[1] Brno Univ Technol, Brno 61266, Czech Republic

[2] Omilia, Maroussi 15126, Greece

[3] Athens Univ Econ & Business, Athina 10434, Greece

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

欧盟地平线“2020”; 美国国家科学基金会;

关键词：

Decoding; Long short term memory; Biological system modeling; Vectors; Oral communication; Data models; Training; Attractor; DiaPer; end-to-end neural diarization; perceiver; speaker diarization;

D O I：

10.1109/TASLP.2024.3422818

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.

引用

页码：3450 / 3465

页数：16

共 91 条

[11] Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization
Chen, Yifan
Guo, Yifan
Li, Qingxuan
Cheng, Gaofeng
Zhang, Pengyuan
Yan, Yonghong
[J]. INTERSPEECH 2022, 2022, : 1456 - 1460
[12] Chen Z., 2023, P ANN C INT SPEECH C, P3552
[13] Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer
Chen, Zhengyang
Han, Bing
Wang, Shuai
Qian, Yanmin
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1636 - 1649
[14] Spot the conversation: speaker diarisation in the wild
Chung, Joon Son
Huh, Jaesung
Nagrani, Arsha
Afouras, Triantafyllos
Zisserman, Andrew
[J]. INTERSPEECH 2020, 2020, : 299 - 303
[15] Cornell S., 2023, P 7 INT WORKSH SPEEC, P1
[16] Delcroix M., 2023, P ANN C INT SPEECH C, P3477
[17] Du ZH, 2021, Arxiv, DOI arXiv:2111.13694
[18] AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
Fu, Yihui
Cheng, Luyao
Lv, Shubo
Jv, Yukai
Kong, Yuxiang
Chen, Zhuo
Hu, Yanxin
Xie, Lei
Wu, Jian
Bu, Hui
Xu, Xin
Du, Jun
Chen, Jingdong
[J]. INTERSPEECH 2021, 2021, : 3665 - 3669
[19] Fujita Y., 2023, P IEEE INT C AC SPEE, P1
[20] End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Fujita, Yusuke
Kanda, Naoyuki
Horiguchi, Shota
Nagamatsu, Kenji
Watanabe, Shinji
[J]. INTERSPEECH 2019, 2019, : 4300 - 4304

← 1 2 3 4 5 6 7 8 9 10 →