Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition

被引:3
|
作者
Deng, Lihong [1 ]
Deng, Fei [2 ]
Zhou, Kepeng [2 ]
Jiang, Peifan [2 ]
Zhang, Gexiang [3 ]
Yang, Qiang [3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Peoples R China
[2] Chengdu Univ Technol, Coll Comp Sci & Cyber Secur, Chengdu, Peoples R China
[3] Chengdu Univ Informat Technol, Sch Automat, Chengdu, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker recognition; Attention mechanism; Aggregation method; Multi-level attention; ARCHITECTURE;
D O I
10.1016/j.engappai.2023.107439
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a more efficient lightweight speaker recognition network, the multi-level attention network (MANet). MANet aims to generate more robust and discriminative speaker features by emphasizing features at different levels in the speaker recognition network through multi-level attention. The multi-level attention contains mixed time-frequency channel (MTFC) attention and multi-scale self-attentive standard deviation pooling (MSSDP). MTFC attention combines channel, time, and frequency information to capture global features and model long-term contexts. MSSDP can capture changes in frame-level features and aggregate frame-level features with different scales, generating a long-term, robust, and discriminative utterance-level feature. Therefore, MANet emphasizes the features of different levels. We performed extensive experiments on two popular datasets, Voxceleb and CN-Celeb. The proposed method is compared with the current state-of-the-art speaker recognition methods. It achieved EER/minDCF of 1.82%/0.1965, 1.94%/0.2059, 3.69%/0.3626, and 11.98%/0.4814 on the test sets Voxceleb1-O, Voxceleb1-E, Voxceleb1-H, and CN-Celeb, respectively. It is a more effective lightweight speaker recognition network, superior to most large speaker recognition networks and all lightweight speaker recognition networks tested, with an improved performance of 64% compared to the baseline system ThinResNet-34. Compared to the lightest EfficientTDNN-Small, it has only 0.6 million more parameters but 63% better performance. The performance of MANet is only 4% different compared to the state-of-the-art large model LE-Conformer. In the ablation experiments, our proposed attention method and aggregation model achieved the best experimental performance in Voxceleb1-O with EER/minDCF of 2.46%/0.2708, 2.39%/0.2417, respectively, which indicates that our proposed methods are a significant improvement over previous methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Multi-level feature fusion capsule network with self-attention for facial expression recognition
    Huang, Zhiji
    Yu, Songsen
    Liang, Jun
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (02)
  • [32] Action Recognition Based on Multi-Level Topological Channel Attention of Human Skeleton
    Hu, Kai
    Shen, Chaowen
    Wang, Tianyan
    Shen, Shuai
    Cai, Chengxue
    Huang, Huaming
    Xia, Min
    SENSORS, 2023, 23 (24)
  • [33] Road Recognition Based on Multi-scale Convolutional Network with Multi-level Feature Fusion
    Li, Ye
    Guo, Lili
    Xu, Lele
    Wang, Xianfeng
    Jin, Shan
    TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069
  • [34] Multi-Scale Self-Attention Network for Denoising Medical Images
    Lee, Kyungsu
    Lee, Haeyun
    Lee, Moon Hwan
    Chang, Jin Ho
    Kuo, C. -C. Jay
    Oh, Seung-June
    Woo, Jonghye
    Hwang, Jae Youn
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (05) : 1 - 26
  • [35] A MULTI-SCALE SELF-ATTENTION NETWORK TO DISCRIMINATE PULMONARY NODULES
    Moreno, Alejandra
    Rueda, Andrea
    Martinez, Fabio
    2022 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (IEEE ISBI 2022), 2022,
  • [36] A time-frequency channel attention and vectorization network for automatic depression level prediction
    Niu, Mingyue
    Liu, Bin
    Tao, Jianhua
    Li, Qifei
    NEUROCOMPUTING, 2021, 450 : 208 - 218
  • [37] Multi-scale Attention Convolutional Neural Network for time series classification
    Chen, Wei
    Shi, Ke
    NEURAL NETWORKS, 2021, 136 (136) : 126 - 140
  • [38] MAFN: multi-level attention fusion network for multimodal named entity recognition
    Zhou, Xiaoying
    Zhang, Yijia
    Wang, Zhuang
    Lu, Mingyu
    Liu, Xiaoxia
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (15) : 45047 - 45058
  • [39] Multi-scale Wavelet Frequency Channel Attention for Remote Sensing Image Segmentation
    Su, Yu-Chen
    Liu, Tsung-Jung
    Liuy, Kuan-Hsien
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [40] MAFN: multi-level attention fusion network for multimodal named entity recognition
    Xiaoying Zhou
    Yijia Zhang
    Zhuang Wang
    Mingyu Lu
    Xiaoxia Liu
    Multimedia Tools and Applications, 2024, 83 : 45047 - 45058