Memory Storable Network Based Feature Aggregation for Speaker Representation Learning

被引:5
|
作者
Gu, Bin [1 ]
Guo, Wu [1 ]
Zhang, Jie [1 ]
机构
[1] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Speaker verification; speaker representation learning; feature aggregation; multi-level information; attention mask; adaptive bias; RECOGNITION; MACHINES;
D O I
10.1109/TASLP.2022.3231709
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Learning fixed-dimensional speaker representation using deep neural networks is a key step in speaker verification. In this work, we propose an auxiliary memory storable network (MSN) to assist a backbone network for learning discriminative features, which are sequentially aggregated from lower to deeper layers of the backbone. The proposed MSN has a similar architecture to the ResNet and contains a set of cascaded feature aggregation (FA) blocks. Each FA block first aggregates the multi-level features from the previous block and the features from the corresponding backbone layer. The output features of each intermediate layer within the backbone are then refined by the multi-level features of the corresponding FA block through masking and biasing operations. Finally, the features from the last layers of both MSN and the backbone are concatenated to form more discriminative speaker representations. Experimental results on five public datasets show significant and consistent improvements over conventional approaches. The effectiveness of the proposed method is also validated using ablation studies, showing a robust generalization capacity in combination with different backbone networks.
引用
收藏
页码:643 / 655
页数:13
相关论文
共 50 条
  • [1] LEARNING SPEAKER REPRESENTATION FOR NEURAL NETWORK BASED MULTICHANNEL SPEAKER EXTRACTION
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Higuchi, Takuya
    Ogawa, Atsunori
    Nakatani, Tomohiro
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 8 - 15
  • [2] Transport-Oriented Feature Aggregation for Speaker Embedding Learning
    Tian, Yusheng
    Li, Jingyu
    Lee, Tan
    INTERSPEECH 2022, 2022, : 316 - 320
  • [3] Network Representation Based on the Joint Learning of Three Feature Views
    Ye, Zhonglin
    Zhao, Haixing
    Zhang, Ke
    Wang, Zhaoyang
    Zhu, Yu
    BIG DATA MINING AND ANALYTICS, 2019, 2 (04): : 248 - 260
  • [4] Network Representation Based on the Joint Learning of Three Feature Views
    Zhonglin Ye
    Haixing Zhao
    Ke Zhang
    Zhaoyang Wang
    Yu Zhu
    Big Data Mining and Analytics, 2019, 2 (04) : 248 - 260
  • [5] Heterogeneous network representation learning based on role feature extraction
    Sun, Yueheng
    Jia, Mengyu
    Liu, Chang
    Shao, Minglai
    PATTERN RECOGNITION, 2023, 144
  • [6] A robust feature based on sparse representation for speaker recognition
    Xie, Yining
    Huang, Jinjie
    Wang, Xinlei
    Journal of Computational Information Systems, 2013, 9 (09): : 3553 - 3561
  • [7] Feature Hashing for Network Representation Learning
    Wang, Qixiang
    Wang, Shanfeng
    Gong, Maoguo
    Wu, Yue
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 2812 - 2818
  • [8] Improving Time Delay Neural Network Based Speaker Recognition With Convolutional Block And Feature Aggregation Methods
    Zhang, Yu-Jia
    Wang, Yih-Wen
    Chen, Chia-Ping
    Lu, Chung-Li
    Chan, Bo-Cheng
    INTERSPEECH 2021, 2021, : 76 - 80
  • [9] ICA-BASED LIP FEATURE REPRESENTATION FOR SPEAKER AUTHENTICATION
    Wang, S. L.
    Liew, A. W. C.
    SITIS 2007: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SIGNAL IMAGE TECHNOLOGIES & INTERNET BASED SYSTEMS, 2008, : 763 - +
  • [10] Inverse Feature Learning: Feature Learning Based on Representation Learning of Error
    Ghazanfari, Behzad
    Afghah, Fatemeh
    Hajiaghayi, Mohammadtaghi
    IEEE ACCESS, 2020, 8 (08): : 132937 - 132949