A Multi-Scale Multi-Task Learning Model for Continuous Dimensional Emotion Recognition from Audio

被引:4
作者
Li, Xia [1 ,2 ]
Lu, Guanming [1 ]
Yan, Jingjie [1 ]
Zhang, Zhengyan [1 ,3 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Telecommun & Informat Engn, Nanjing 210003, Peoples R China
[2] Anhui Univ Technol, Sch Math & Phys, Maanshan 243000, Peoples R China
[3] Jiangsu Univ Sci & Technol, Sch Elect & Informat, Zhenjiang 212003, Peoples R China
基金
中国国家自然科学基金;
关键词
continuous dimensional emotion recognition; multi-task learning; deep belief network;
D O I
10.3390/electronics11030417
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the advantages of many aspects of the dimensional emotion model, continuous dimensional emotion recognition from audio has attracted increasing attention in recent years. Features and dimensional emotion labels on different time scales have different characteristics and contain different information. To make full use of the advantages of features and emotion representations from multiple time scales, a novel multi-scale multi-task (MSMT) learning model is proposed in this paper. The MSMT model is constructed by a deep belief network (DBN) with only one hidden layer. The same hidden layer parameters and linear layer parameters are shared by all features. Multiple temporal pooling operations are inserted between the hidden layer and the linear layer to obtain information on multiple time scales. The mean squared error (MSE) of the main and the secondary task are combined to form the final objective function. Extensive experiments were conducted on RECOLA and SEMAINE datasets to illustrate the effectiveness of our model. The results for the two sets show that even adding a secondary scale to the scale with optimal single-scale single-task performance can achieve significant performance improvements.
引用
收藏
页数:16
相关论文
共 35 条
  • [1] [Anonymous], 2015, P P ACM MULT WORKSH
  • [2] Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information
    Arifin, Sutjipto
    Cheung, Peter Y. K.
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (07) : 1325 - 1341
  • [3] Caruana R., 1993, ICML, P41, DOI 10.1016/b978-1-55860-307-3.50012-5
  • [4] Emotion Recognition from Multiband EEG Signals Using CapsNet
    Chao, Hao
    Dong, Liang
    Liu, Yongli
    Lu, Baoyun
    [J]. SENSORS, 2019, 19 (09)
  • [5] Chao L, 2015, P 5 INT WORKSH AUD V, ppp65, DOI DOI 10.1145/2808196.2811634
  • [6] Chao L., 2014, P 4 INT WORKSH AUD V, P11, DOI [10.1145/2661806.2661811, DOI 10.1145/2661806.2661811]
  • [7] Eyben F., 2013, P 21 ACM INT C MULT, P835, DOI [DOI 10.1145/2502081.2502224, https://doi.org/10.1145/2502081.2502224]
  • [8] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
    Eyben, Florian
    Scherer, Klaus R.
    Schuller, Bjoern W.
    Sundberg, Johan
    Andre, Elisabeth
    Busso, Carlos
    Devillers, Laurence Y.
    Epps, Julien
    Laukka, Petri
    Narayanan, Shrikanth S.
    Truong, Khiet P.
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
  • [9] Grimm M, 2005, 2005 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P381
  • [10] Hamel P., 2011, Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), P729