Improvement of Accent Classification Models Through Grad-Transfer From Spectrograms and Gradient-Weighted Class Activation Mapping

被引:4
作者
Carofilis, Andres [1 ]
Alegre, Enrique [1 ]
Fidalgo, Eduardo [1 ]
Fernandez-Robles, Laura [2 ]
机构
[1] Univ Leon, Dept Elect Elect & Syst Engn, Leon 24071, Spain
[2] Univ Leon, Dept Mech Informat & Aeroespace Engn, Leon 24071, Spain
关键词
Index Terms-Accent classification; Grad-CAM; Grad-Transfer; speech processing; LANGUAGE IDENTIFICATION; FEATURES; LONG;
D O I
10.1109/TASLP.2023.3297961
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automatic accent classification is an active research field concerning speech processing. It can be useful to identify a speaker's region of origin, which can be applied in police investigations carried out by Law Enforcement Agencies, as well as for the improvement of current speech recognition systems. This article presents a novel descriptor called Grad-Transfer, extracted using the Gradient-weighted Class Activation Mapping (Grad-CAM) method based on convolutional neural network (CNN) interpretability. Additionally, we propose a methodology for accent classification that implements Grad-Transfer, which is based on transferring the knowledge acquired by a CNN to a classical machine learning algorithm. The article works on two hypotheses: the coarse localization maps produced by Grad-CAM on spectrograms are able to highlight the regions of the spectrograms that are important for predicting accents, and Grad-Transfer descriptors computed from audios represent distinctive descriptions of the target accents. These hypotheses were demonstrated experimentally, clustering the generated Grad-Transfer descriptors according to the original accent of the audios using Birch and $k$-means algorithms. We carried out experiments on the Voice Cloning Toolkit dataset, seeing an increase of macro average accuracy, and unweighted average recall in the results obtained by a Gaussian Naive Bayes classifier up to 23.00%, and 23.58%, respectively, compared to a model trained with spectrograms. This demonstrates that Grad-Transfer is able to improve the performance of accent classification models and opens the door to new implementations in similar tasks.
引用
收藏
页码:2859 / 2871
页数:13
相关论文
共 76 条
  • [1] Native Language Identification in Very Short Utterances Using Bidirectional Long Short-Term Memory Network
    Adeeba, Farah
    Hussain, Sarmad
    [J]. IEEE ACCESS, 2019, 7 : 17098 - 17110
  • [2] Ahmed A., 2019, AEROSP CONF PROC, P1, DOI [10.1109/AERO.2019.8742023, DOI 10.1109/aero.2019.8742023, DOI 10.1109/AERO.2019.8742023]
  • [3] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION
    ALTMAN, NS
    [J]. AMERICAN STATISTICIAN, 1992, 46 (03) : 175 - 185
  • [4] Amodei D., 2016, P 33 INT C MACH LEAR, P173
  • [5] Advances in phone-based modeling for automatic accent classification
    Angkititrakul, P
    Hansen, JHL
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02): : 634 - 646
  • [6] [Anonymous], 2007, INT ARAB J INF TECHN
  • [7] Study of temporal features and frequency characteristics in American English foreign accent
    Arslan, LM
    Hansen, JHL
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1997, 102 (01) : 28 - 40
  • [8] Baevski A, 2020, ADV NEUR IN, V33
  • [9] Balakrishnama S., 1998, Linear discriminant analysis‐a brief tutorial, V18, P1
  • [10] Language Identification Using Deep Convolutional Recurrent Neural Networks
    Bartz, Christian
    Herold, Tom
    Yang, Haojin
    Meinel, Christoph
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT VI, 2017, 10639 : 880 - 889