Class token and knowledge distillation for multi-head self-attention speaker verification systems

被引：6

作者：

Mingote, Victoria ^{[1
]}

Miguel, Antonio ^{[1
]}

Ortega, Alfonso ^{[1
]}

Lleida, Eduardo ^{[1
]}

机构：

[1] Univ Zaragoza, Aragon Inst Engn Res I3A, ViVoLab, Zaragoza, Spain

来源：

DIGITAL SIGNAL PROCESSING | 2023年 / 133卷

关键词：

Class token; Teacher -student learning; Distillation token; Speaker verification; Multi -head self -attention; Memory layers;

D O I：

10.1016/j.dsp.2022.103859

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a new sampling estimation of the class token. In this approach, the class token is obtained by sampling from a list of several trainable vectors. This strategy introduces uncertainty that helps to generalize better compared to a single initialization as it is shown in the experiments. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings. (c) 2022 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

引用

页数：10

共 46 条

[1] [Anonymous], 2010, NIST YEAR 2010 SPEAK
[2] Balan A. K., 2015, P ADV NEUR INF PROC, V28, P1
[3] Blundell C, 2015, PR MACH LEARN RES, V37, P1613
[4] Cai WC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5189, DOI 10.1109/ICASSP.2018.8462025
[5] Improving X-vector and PLDA for Text-dependent Speaker Verification
Chen, Zhuxin
Lin, Yue
[J]. INTERSPEECH 2020, 2020, : 726 - 730
[6] Chung JS, 2018, INTERSPEECH, P1086
[7] Das RK, 2018, ASIAPAC SIGN INFO PR, P1708, DOI 10.23919/APSIPA.2018.8659742
[8] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Desplanques, Brecht
Thienpondt, Jenthe
Demuynck, Kris
[J]. INTERSPEECH 2020, 2020, : 3830 - 3834
[9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

← 1 2 3 4 5 →