Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization

被引:1
作者
Jeoung, Ye-Rin [1 ]
Choi, Jeong-Hwan [1 ]
Seong, Ju-Seok [1 ]
Kyung, JeHyun [1 ]
Chang, Joon-Hyuk [1 ]
机构
[1] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
来源
INTERSPEECH 2023 | 2023年
关键词
speaker diarization; end-to-end neural diarization; self-attention mechanism; fine-tuning; self-distillation;
D O I
10.21437/Interspeech.2023-1404
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this study, we explore self-distillation (SD) techniques to improve the performance of the transformer-encoder-based selfattentive (SA) end-to-end neural speaker diarization (EEND). We first apply the SD approaches, introduced in the automatic speech recognition field, to the SA-EEND model to confirm their potential for speaker diarization. Then, we propose two novel SD methods for the SA-EEND, which distill the prediction output of the model or the SA heads of the upper blocks into the SA heads of the lower blocks. Consequently, we expect the high-level speaker-discriminative knowledge learned by the upper blocks to be shared across the lower blocks, thereby enabling the SA heads of the lower blocks to effectively capture the discriminative patterns of overlapped speech of multiple speakers. Experimental results on the simulated and CALL-HOME datasets show that the SD generally improves the baseline performance, and the proposed methods outperform the conventional SD approaches.
引用
收藏
页码:3197 / 3201
页数:5
相关论文
共 23 条
  • [1] Agarap A.F., 2018, CoRR, pabs/1803.08375
  • [2] Alvin M. P., 2004, P OD
  • [3] Ba J. L., 2016, Layer Normalization
  • [4] Dean J., 2015, ARXIV PREPRINT ARXIV
  • [5] Fiscus JG, 2008, LECT NOTES COMPUT SC, V4625, P373
  • [6] End-to-End Neural Speaker Diarization with Permutation-Free Objectives
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. INTERSPEECH 2019, 2019, : 4300 - 4304
  • [7] Fujita Y, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P296, DOI [10.1109/asru46091.2019.9003959, 10.1109/ASRU46091.2019.9003959]
  • [8] Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
  • [9] An overhead-free region-based JPEG framework for task-driven image compression
    Jeong, Seonghye
    Jeong, Seongmoon
    Woo, Simon S.
    Ko, Jong Hwan
    [J]. PATTERN RECOGNITION LETTERS, 2023, 165 : 1 - 8
  • [10] Jeoung Ye-Rin, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P1, DOI 10.1109/ICASSP49357.2023.10095589