Deep CNNs With Self-Attention for Speaker Identification

被引:53
|
作者
Nguyen Nang An [1 ]
Nguyen Quang Thanh [1 ]
Liu, Yanbing [2 ]
机构
[1] Chongqing Univ Posts & Telecommun, Dept Comp Sci & Technol, Chongqing 400065, Peoples R China
[2] Chongqing Univ Posts & Telecommun, Chongqing Engn Lab Internet & Informat Secur, Chongqing 400065, Peoples R China
关键词
Speaker identification; deep neural networks; self-attention; embedding learning; SUPPORT VECTOR MACHINES; RECOGNITION; QUANTIZATION; ROBUSTNESS;
D O I
10.1109/ACCESS.2019.2917470
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most current works on speaker identification are based on i-vector methods; however, there is a marked shift from the traditional i-vector to deep learning methods, especially in the form of convolutional neural networks (CNNs). Rather than designing features and a subsequent individual classification model, we address the problem by learning features and recognition systems using deep neural networks. Based on the deep convolutional neural network (CNN), this paper presents a novel text-independent speaker identification method for speaker separation. Specifically, this paper is based on the two representative CNNs, called the visual geometry group (VGG) nets and residual neural networks (ResNets). Unlike prior deep neural network-based speaker identification methods that usually rely on a temporal maximum or average pooling across all time steps to map variable-length utterances to a fixed-dimension vector, this paper equips these two CNNs with a structured self-attention mechanism to learn a weighted average across all time steps. Using the structured self-attention layer with multiple attention hops, the proposed deep CNN network is not only capable of handling variable-length segments but also able to learn speaker characteristics from different aspects of the input sequence. The experimental results on the speaker identification benchmark database, VoxCeleb demonstrate the superiority of the proposed method over the traditional i-vector-based methods and the other strong CNN baselines. In addition, the results suggest that it is possible to cluster unknown speakers using the activation of an upper layer of a pre-trained identification CNN as a speaker embedding vector.
引用
收藏
页码:85327 / 85337
页数:11
相关论文
共 50 条
  • [1] Speaker identification for household scenarios with self-attention and adversarial training
    Li, Ruirui
    Jiang, Jyun-Yu
    Wu, Xian
    Hsieh, Chu-Cheng
    Stolcke, Andreas
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, 2020-October : 2272 - 2276
  • [2] Speaker Identification for Household Scenarios with Self-attention and Adversarial Training
    Li, Ruirui
    Joang, Jyun-Yu
    Wu, Xian
    Hsieh, Chu-Cheng
    Stolcke, Andreas
    INTERSPEECH 2020, 2020, : 2272 - 2276
  • [3] Self-Attention Encoding and Pooling for Speaker Recognition
    Safari, Pooyan
    India, Miquel
    Hernando, Javier
    INTERSPEECH 2020, 2020, : 941 - 945
  • [4] An ensemble of CNNs with self-attention mechanism for DeepFake video detection
    Omar, Karima
    Sakr, Rasha H.
    Alrahmawy, Mohammed F.
    Neural Computing and Applications, 2024, 36 (06) : 2749 - 2765
  • [5] Speaker diarization with variants of self-attention and joint speaker embedding extractor
    Fu, Pengbin
    Ma, Yuchen
    Yang, Huirong
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (05) : 9169 - 9180
  • [6] An ensemble of CNNs with self-attention mechanism for DeepFake video detection
    Omar, Karima
    Sakr, Rasha H.
    Alrahmawy, Mohammed F.
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (06): : 2749 - 2765
  • [7] An ensemble of CNNs with self-attention mechanism for DeepFake video detection
    Karima Omar
    Rasha H. Sakr
    Mohammed F. Alrahmawy
    Neural Computing and Applications, 2024, 36 : 2749 - 2765
  • [8] Speaker-Aware Speech Enhancement with Self-Attention
    Lin, Ju
    Van Wijngaarden, Adriaan J.
    Smith, Melissa C.
    Wang, Kuang-Ching
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 486 - 490
  • [9] LOCAL INFORMATION MODELING WITH SELF-ATTENTION FOR SPEAKER VERIFICATION
    Han, Bing
    Chen, Zhengyang
    Qian, Yanmin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6727 - 6731
  • [10] LOCAL INFORMATION MODELING WITH SELF-ATTENTION FOR SPEAKER VERIFICATION
    Han, Bing
    Chen, Zhengyang
    Qian, Yanmin
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022, 2022-May : 6727 - 6731