Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps

被引:0
|
作者
Choi, Jeong-Hwan [1 ]
Yang, Joon-Young [1 ]
Chang, Joon-Hyuk [1 ]
机构
[1] Hanyang Univ, Sch Elect, Seoul 04763, South Korea
关键词
Automatic speaker verification; knowledge distillation; lightweight model; speaker embedding extractor; ARCHITECTURE;
D O I
10.1109/TASLP.2024.3463491
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency-time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.
引用
收藏
页码:4580 / 4595
页数:16
相关论文
共 6 条
  • [1] CNN-TRANSFORMER WITH SELF-ATTENTION NETWORK FOR SOUND EVENT DETECTION
    Wakayama, Keigo
    Saito, Shoichiro
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 806 - 810
  • [2] SEFormer: A Lightweight CNN-Transformer Based on Separable Multiscale Depthwise Convolution and Efficient Self-Attention for Rotating Machinery Fault Diagnosis
    Wang, Hongxing
    Ju, Xilai
    Zhu, Hua
    Li, Huafeng
    CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (01): : 1417 - 1437
  • [3] Class token and knowledge distillation for multi-head self-attention speaker verification systems
    Mingote, Victoria
    Miguel, Antonio
    Ortega, Alfonso
    Lleida, Eduardo
    DIGITAL SIGNAL PROCESSING, 2023, 133
  • [4] Global-Local Self-Attention Based Transformer for Speaker Verification
    Xie, Fei
    Zhang, Dalong
    Liu, Chengming
    APPLIED SCIENCES-BASEL, 2022, 12 (19):
  • [5] Sleep-CMKD: Self-Attention CNN/Transformer Cross-Model Knowledge Distillation for Automatic Sleep Staging
    Kim, Hyounggyu
    Kim, Moogyeong
    Chung, Wonzoo
    2023 11TH INTERNATIONAL WINTER CONFERENCE ON BRAIN-COMPUTER INTERFACE, BCI, 2023,
  • [6] Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization
    Jeoung, Ye-Rin
    Choi, Jeong-Hwan
    Seong, Ju-Seok
    Kyung, JeHyun
    Chang, Joon-Hyuk
    INTERSPEECH 2023, 2023, : 3197 - 3201