BSML: Bidirectional Sampling Aggregation-based Metric Learning for Low-resource Uyghur Few-shot Speaker Verification

被引:1
作者
Zi, Yunfei [1 ]
Xiong, Shengwu [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp & Artificial Intelligence, 122 Luoshi Rd, Wuhan 430070, Hubei, Peoples R China
关键词
Uyghur; bidirectional sampling; metric learning; few-shot; speaker verification; low-resource language; limited data; RECOGNITION;
D O I
10.1145/3564782
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-independent speaker verification has remained a hot research topic, especially for the limited enrollment and/or test data. At the same time, due to the lack of sufficient training data, the study of low-resource few-shot speaker verification makes the models prone to overfitting and low accuracy of recognition. Therefore, a bidirectional sampling aggregation-based meta-metric learning method is proposed to solve the low-accuracy problem of speaker recognition in a low-resource environment with limited data, termed bidirectional sampling multi-scale Fisher feature fusion (BSML). First, the BSML method was used for effective feature enhancement in the feature extraction stage; second, a large number of similar and disjoint tasks were used to train the models to learn how to compare sample similarity; finally, new tasks were used to identify unknown samples by calculating the similarity of the samples. Extensive experiments are conducted on a short-duration text-independent speaker verification dataset generated from the THUYG-20 low-resource Uyghur with limited data, which comprised speech samples of diverse lengths. The experimental result has shown that the metric learning approach is effective in avoiding model overfitting and improving model generalization, with significant results in the identification of short-duration speaker verification in low-resource Uyghur with few-shot. It also demonstrates that BSML outperforms the state-of-the-art deep-embedding speaker recognition architectures and recent metric learning approach by at least 18%-67% in the few-shot test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods and achieves better performance and generalization ability.
引用
收藏
页数:23
相关论文
共 51 条
  • [1] [艾斯卡尔·肉孜 Aisikaer Rouzi], 2017, [清华大学学报. 自然科学版, Journal of Tsinghua University. Science and Technology], V57, P182
  • [2] Anand P, 2019, Arxiv, DOI arXiv:1904.08775
  • [3] [Anonymous], 1926, J AM I CRIM LAW CRIM, DOI DOI 10.2307/1134501
  • [4] Bromley J., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P669, DOI 10.1142/S0218001493000339
  • [5] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
  • [6] Speaker recognition: A tutorial
    Campbell, JP
    [J]. PROCEEDINGS OF THE IEEE, 1997, 85 (09) : 1437 - 1462
  • [7] Fusing MFCC and LPC Features Using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals
    Chowdhury, Anurag
    Ross, Arun
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2020, 15 : 1616 - 1629
  • [8] Counter Peter B, 2015, MORPHO AGNITIO PARTN
  • [9] Attentional Feature Fusion
    Dai, Yimian
    Gieseke, Fabian
    Oehmcke, Stefan
    Wu, Yiquan
    Barnard, Kobus
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 3559 - 3568
  • [10] Exploring different attributes of source information for speaker verification with limited test data
    Das, Rohan Kumar
    Prasanna, S. R. Mahadeva
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 140 (01) : 184 - 190