Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition

被引:3
|
作者
Deng, Lihong [1 ]
Deng, Fei [2 ]
Zhou, Kepeng [2 ]
Jiang, Peifan [2 ]
Zhang, Gexiang [3 ]
Yang, Qiang [3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Peoples R China
[2] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China
[3] Chengdu Univ Informat Technol, Sch Automat, Chengdu, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker recognition; Attention mechanism; Aggregation method; Multi-level attention; ARCHITECTURE;
D O I
10.1016/j.engappai.2023.107439
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a more efficient lightweight speaker recognition network, the multi-level attention network (MANet). MANet aims to generate more robust and discriminative speaker features by emphasizing features at different levels in the speaker recognition network through multi-level attention. The multi-level attention contains mixed time-frequency channel (MTFC) attention and multi-scale self-attentive standard deviation pooling (MSSDP). MTFC attention combines channel, time, and frequency information to capture global features and model long-term contexts. MSSDP can capture changes in frame-level features and aggregate frame-level features with different scales, generating a long-term, robust, and discriminative utterance-level feature. Therefore, MANet emphasizes the features of different levels. We performed extensive experiments on two popular datasets, Voxceleb and CN-Celeb. The proposed method is compared with the current state-of-the-art speaker recognition methods. It achieved EER/minDCF of 1.82%/0.1965, 1.94%/0.2059, 3.69%/0.3626, and 11.98%/0.4814 on the test sets Voxceleb1-O, Voxceleb1-E, Voxceleb1-H, and CN-Celeb, respectively. It is a more effective lightweight speaker recognition network, superior to most large speaker recognition networks and all lightweight speaker recognition networks tested, with an improved performance of 64% compared to the baseline system ThinResNet-34. Compared to the lightest EfficientTDNN-Small, it has only 0.6 million more parameters but 63% better performance. The performance of MANet is only 4% different compared to the state-of-the-art large model LE-Conformer. In the ablation experiments, our proposed attention method and aggregation model achieved the best experimental performance in Voxceleb1-O with EER/minDCF of 2.46%/0.2708, 2.39%/0.2417, respectively, which indicates that our proposed methods are a significant improvement over previous methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Multi-Scale Time-Frequency Attention for Acoustic Event Detection
    Zhang, Jingyang
    Ding, Wenhao
    Kang, Jintao
    He, Liang
    INTERSPEECH 2019, 2019, : 3855 - 3859
  • [2] Local climate zone classification using a multi-scale, multi-level attention network
    Kim, Minho
    Jeong, Doyoung
    Kim, Yongil
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2021, 181 (181) : 345 - 366
  • [3] MLANet: multi-level attention network with multi-scale feature fusion for crowd counting
    Xiong, Liyan
    Zeng, Yijuan
    Huang, Xiaohui
    Li, Zhida
    Huang, Peng
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (05): : 6591 - 6608
  • [4] FREQUENCY AND MULTI-SCALE SELECTIVE KERNEL ATTENTION FOR SPEAKER VERIFICATION
    Mun, Sung Hwan
    Jung, Jee-Weon
    Han, Min Hyun
    Kim, Nam Soo
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 548 - 554
  • [5] Multi-scale and Multi-level Attention Based on External Knowledge in EHRs
    Le, Duc
    Le, Bac
    RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2024, PT I, 2024, 2144 : 113 - 125
  • [6] Multi-level channel attention excitation network for human action recognition in videos
    Wu, Hanbo
    Ma, Xin
    Li, Yibin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 114
  • [7] A Multi-Scale Channel Attention Network for Prostate Segmentation
    Ding, Meiwen
    Lin, Zhiping
    Lee, Chau Hung
    Tan, Cher Heng
    Huang, Weimin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (05) : 1754 - 1758
  • [8] A parallel multi-scale time-frequency block convolutional neural network based on channel attention module for motor imagery classification
    Li, Hongli
    Chen, Hongyu
    Jia, Ziyu
    Zhang, Ronghua
    Yin, Feichao
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 79
  • [9] A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification
    Chen, Peng
    Zhou, Hui
    Li, Ying
    Liu, Bingxin
    Liu, Peng
    JOURNAL OF MARINE SCIENCE AND ENGINEERING, 2024, 12 (09)
  • [10] Speech Emotion Recognition via Multi-Level Attention Network
    Liu, Ke
    Wang, Dekui
    Wu, Dongya
    Liu, Yutao
    Feng, Jun
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2278 - 2282