DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors

被引:2
作者
Landini, Federico [1 ]
Diez, Mireia [1 ]
Stafylakis, Themos [2 ,3 ]
Burget, Lukas [1 ]
机构
[1] Brno Univ Technol, Brno 61266, Czech Republic
[2] Omilia, Maroussi 15126, Greece
[3] Athens Univ Econ & Business, Athina 10434, Greece
基金
欧盟地平线“2020”; 美国国家科学基金会;
关键词
Decoding; Long short term memory; Biological system modeling; Vectors; Oral communication; Data models; Training; Attractor; DiaPer; end-to-end neural diarization; perceiver; speaker diarization;
D O I
10.1109/TASLP.2024.3422818
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
引用
收藏
页码:3450 / 3465
页数:16
相关论文
共 91 条
  • [21] Fujita Y, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P296, DOI [10.1109/ASRU46091.2019.9003959, 10.1109/asru46091.2019.9003959]
  • [22] Graff D., 2002, Switchboard-2 phase iii audio
  • [23] Graff D., 1998, Switchboard-2 phase I, LDC98S75
  • [24] Graff D., 2004, Switchboard Cellular Part 2 Audio
  • [25] Graff D., 1999, SWITCHBOARD 2 PHASE
  • [26] Graff D., 2001, Switchboard cellular part 1 audio
  • [27] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [28] BW-EDA-EEND: STREAMING END-TO-END NEURAL SPEAKER DIARIZATION FOR A VARIABLE NUMBER OF SPEAKERS
    Han, Eunjung
    Lee, Chul
    Stolcke, Andreas
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7193 - 7197
  • [29] End-to-end neural speaker diarization with an iterative adaptive attractor estimation
    Hao, Fengyuan
    Li, Xiaodong
    Zheng, Chengshi
    [J]. NEURAL NETWORKS, 2023, 166 : 566 - 578
  • [30] ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding
    He, Mao-Kui
    Du, Jun
    Liu, Qing-Feng
    Lee, Chin-Hui
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1561 - 1573