Optimizing Multi-Taper Features for Deep Speaker Verification

被引:0
作者
Liu, Xuechen [1 ,2 ]
Sahidullah, Md [1 ]
Kinnunen, Tomi [2 ]
机构
[1] Univ Lorraine, CNRS, INRIA, LORIA, F-54000 Nancy, France
[2] Univ Eastern Finland, Sch Comp, FI-80101 Joensuu, Finland
基金
芬兰科学院;
关键词
Feature extraction; Discrete Fourier transforms; Task analysis; Neural networks; Mel frequency cepstral coefficient; Stochastic processes; Standards; Multi-taper spectrum; speaker verification; RECOGNITION; MFCC;
D O I
10.1109/LSP.2021.3122796
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-taper MFCCs with deep ASV systems remains an open question. Instead of a static-taper design, we propose to optimize the multi-taper estimator jointly with a deep neural network trained for ASV tasks. With a maximum improvement on the SITW corpus of 25.8% in terms of equal error rate over the static-taper, our method helps preserve a balanced level of leakage and variance, providing more robustness.
引用
收藏
页码:2187 / 2191
页数:5
相关论文
共 33 条
  • [1] Alam MJ, 2014, EUR SIGNAL PR CONF, P944
  • [2] Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems
    Alam, Md. Jahangir
    Kenny, Patrick
    O'Shaughnessy, Douglas
    [J]. COGNITIVE COMPUTATION, 2013, 5 (04) : 533 - 544
  • [3] Multitaper MFCC and PLP features for speaker verification using i-vectors
    Alam, Md Jahangir
    Kinnunen, Tomi
    Kenny, Patrick
    Ouellet, Pierre
    O'Shaughnessy, Douglas
    [J]. SPEECH COMMUNICATION, 2013, 55 (02) : 237 - 251
  • [4] [Anonymous], 2013, ARXIV13042865
  • [5] Speaker recognition based on deep learning: An overview
    Bai, Zhongxin
    Zhang, Xiao-Lei
    [J]. NEURAL NETWORKS, 2021, 140 : 65 - 99
  • [6] Catford J. C., 1988, PRACTICAL INTRO PHON
  • [7] Chung JS, 2018, INTERSPEECH, P1086
  • [8] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
    Deng, Jiankang
    Guo, Jia
    Xue, Niannan
    Zafeiriou, Stefanos
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4685 - 4694
  • [9] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
    Desplanques, Brecht
    Thienpondt, Jenthe
    Demuynck, Kris
    [J]. INTERSPEECH 2020, 2020, : 3830 - 3834
  • [10] Speaker Recognition by Machines and Humans
    Hansen, John H. L.
    Hasan, Taufiq
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (06) : 74 - 99