DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors

被引:2
作者
Landini, Federico [1 ]
Diez, Mireia [1 ]
Stafylakis, Themos [2 ,3 ]
Burget, Lukas [1 ]
机构
[1] Brno Univ Technol, Brno 61266, Czech Republic
[2] Omilia, Maroussi 15126, Greece
[3] Athens Univ Econ & Business, Athina 10434, Greece
基金
欧盟地平线“2020”; 美国国家科学基金会;
关键词
Decoding; Long short term memory; Biological system modeling; Vectors; Oral communication; Data models; Training; Attractor; DiaPer; end-to-end neural diarization; perceiver; speaker diarization;
D O I
10.1109/TASLP.2024.3422818
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
引用
收藏
页码:3450 / 3465
页数:16
相关论文
共 91 条
  • [1] [Anonymous], 2000, NIST SRE 2000 Evaluation Plan
  • [2] [Anonymous], 2009, NIST Rich Transcription Evaluations
  • [3] Barondi S., 2023, VOXTELEB SPEAKER REC
  • [4] Brandschain L, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2441
  • [5] pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe
    Bredin, Herve
    [J]. INTERSPEECH 2023, 2023, : 1983 - 1987
  • [6] Bredin H, 2020, INT CONF ACOUST SPEE, P7124, DOI [10.1109/ICASSP40776.2020.9052974, 10.1109/icassp40776.2020.9052974]
  • [7] Improving End-to-End Neural Diarization Using Conversational Summary Representations
    Broughton, Samuel J.
    Samarakoon, Lahiru
    [J]. INTERSPEECH 2023, 2023, : 3157 - 3161
  • [8] Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28
  • [9] Carletta J., 2005, INT C METHODS TECHNI, V88
  • [10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
    Chen, Sanyuan
    Wang, Chengyi
    Chen, Zhengyang
    Wu, Yu
    Liu, Shujie
    Chen, Zhuo
    Li, Jinyu
    Kanda, Naoyuki
    Yoshioka, Takuya
    Xiao, Xiong
    Wu, Jian
    Zhou, Long
    Ren, Shuo
    Qian, Yanmin
    Qian, Yao
    Zeng, Michael
    Yu, Xiangzhan
    Wei, Furu
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518