Improvement of Accent Classification Models Through Grad-Transfer From Spectrograms and Gradient-Weighted Class Activation Mapping

被引：4

作者：

Carofilis, Andres ^{[1
]}

Alegre, Enrique ^{[1
]}

Fidalgo, Eduardo ^{[1
]}

Fernandez-Robles, Laura ^{[2
]}

机构：

[1] Univ Leon, Dept Elect Elect & Syst Engn, Leon 24071, Spain

[2] Univ Leon, Dept Mech Informat & Aeroespace Engn, Leon 24071, Spain

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Index Terms-Accent classification; Grad-CAM; Grad-Transfer; speech processing; LANGUAGE IDENTIFICATION; FEATURES; LONG;

D O I：

10.1109/TASLP.2023.3297961

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Automatic accent classification is an active research field concerning speech processing. It can be useful to identify a speaker's region of origin, which can be applied in police investigations carried out by Law Enforcement Agencies, as well as for the improvement of current speech recognition systems. This article presents a novel descriptor called Grad-Transfer, extracted using the Gradient-weighted Class Activation Mapping (Grad-CAM) method based on convolutional neural network (CNN) interpretability. Additionally, we propose a methodology for accent classification that implements Grad-Transfer, which is based on transferring the knowledge acquired by a CNN to a classical machine learning algorithm. The article works on two hypotheses: the coarse localization maps produced by Grad-CAM on spectrograms are able to highlight the regions of the spectrograms that are important for predicting accents, and Grad-Transfer descriptors computed from audios represent distinctive descriptions of the target accents. These hypotheses were demonstrated experimentally, clustering the generated Grad-Transfer descriptors according to the original accent of the audios using Birch and $k$-means algorithms. We carried out experiments on the Voice Cloning Toolkit dataset, seeing an increase of macro average accuracy, and unweighted average recall in the results obtained by a Gaussian Naive Bayes classifier up to 23.00%, and 23.58%, respectively, compared to a model trained with spectrograms. This demonstrates that Grad-Transfer is able to improve the performance of accent classification models and opens the door to new implementations in similar tasks.

引用

页码：2859 / 2871

页数：13

共 76 条

[1] Native Language Identification in Very Short Utterances Using Bidirectional Long Short-Term Memory Network
Adeeba, Farah
Hussain, Sarmad
[J]. IEEE ACCESS, 2019, 7 : 17098 - 17110
[2] Ahmed A., 2019, AEROSP CONF PROC, P1, DOI [10.1109/AERO.2019.8742023, DOI 10.1109/aero.2019.8742023, DOI 10.1109/AERO.2019.8742023]
[3] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION
ALTMAN, NS
[J]. AMERICAN STATISTICIAN, 1992, 46 (03) : 175 - 185
[4] Amodei D., 2016, P 33 INT C MACH LEAR, P173
[5] Advances in phone-based modeling for automatic accent classification
Angkititrakul, P
Hansen, JHL
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02): : 634 - 646
[6] [Anonymous], 2007, INT ARAB J INF TECHN
[7] Study of temporal features and frequency characteristics in American English foreign accent
Arslan, LM
Hansen, JHL
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1997, 102 (01) : 28 - 40
[8] Baevski A, 2020, ADV NEUR IN, V33
[9] Balakrishnama S., 1998, Linear discriminant analysis‐a brief tutorial, V18, P1
[10] Language Identification Using Deep Convolutional Recurrent Neural Networks
Bartz, Christian
Herold, Tom
Yang, Haojin
Meinel, Christoph
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT VI, 2017, 10639 : 880 - 889

← 1 2 3 4 5 6 7 8 →