On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition

被引:3
作者
Mukhamediya, Azamat [1 ]
Fazli, Siamac [2 ]
Zollanvari, Amin [1 ]
机构
[1] Nazarbayev Univ, Sch Engn & Digital Sci, Dept Elect & Comp Engn, Astana 010000, Kazakhstan
[2] Nazarbayev Univ, Sch Engn & Digital Sci, Dept Comp Sci, Astana 010000, Kazakhstan
关键词
Log-Mel spectrogram; speech emotion recognition; SqueezeNet; NEURAL-NETWORKS;
D O I
10.1109/ACCESS.2023.3287093
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition (SER) has become a major area of investigation in human-computer interaction. Conventionally, SER is formulated as a classification problem that follows a common methodology: (i) extracting features from speech signals; and (ii) constructing an emotion classifier using extracted features. With the advent of deep learning, however, the former stage is integrated into the latter. That is to say, deep neural networks (DNNs), which are trained using log-Mel spectrograms (LMS) of audio waveforms, extract discriminative features from LMS. A critical issue, and one that is often overlooked, is that this procedure is done without relating the choice of LMS parameters to the performance of the trained DNN classifiers. It is commonplace in SER studies that practitioners assume some "usual" values for these parameters and devote major efforts to training and comparing various DNN architectures. In contrast with this common approach, in this work we choose a single lightweight pre-trained architecture, namely, SqueezeNet, and shift our main effort into tuning LMS parameters. Our empirical results using three publicly available SER datasets show that: (i) parameters of LMS can considerably affect the performance of DNNs; and (ii) by tuning LMS parameters, highly competitive classification performance can be achieved. In particular, treating LMS parameters as hyperparameters and tuning them led to similar to 23%, similar to 10%, and similar to 11% improvement in contrast with the use of "usual" values of LMS parameters in EmoDB, IEMOCAP, and SAVEE datasets, respectively.
引用
收藏
页码:61950 / 61957
页数:8
相关论文
共 40 条
  • [1] Amodei D., 2016, P INT C MACH LEARN, P173
  • [2] Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files
    Andayani, Felicia
    Theng, Lau Bee
    Tsun, Mark Teekit
    Chua, Caslon
    [J]. IEEE ACCESS, 2022, 10 : 36018 - 36027
  • [3] Ashok Kumar J., 2020, MACHINE LEARNING MET, P234
  • [4] Audiovisual emotion recognition in wild
    Avots, Egils
    Sapinski, Tomasz
    Bachmann, Maie
    Kaminska, Dorota
    [J]. MACHINE VISION AND APPLICATIONS, 2019, 30 (05) : 975 - 985
  • [5] Towards real-time speech emotion recognition for affective e-learning
    Bahreini K.
    Nadolski R.
    Westera W.
    [J]. Education and Information Technologies, 2016, 21 (5) : 1367 - 1386
  • [6] Berlin TU., 2005, INTERSPEECH, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
  • [7] Call Redistribution for a Call Center Based on Speech Emotion Recognition
    Bojanic, Milana
    Delic, Vlado
    Karpov, Alexey
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (13):
  • [8] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [9] Deep Learning With Edge Computing: A Review
    Chen, Jiasi
    Ran, Xukan
    [J]. PROCEEDINGS OF THE IEEE, 2019, 107 (08) : 1655 - 1674
  • [10] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
    Chen, Mingyi
    He, Xuanji
    Yang, Jing
    Zhang, Han
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) : 1440 - 1444