Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network

被引:23
作者
Ocquaye, Elias N. N. [1 ]
Mao, Qirong [1 ]
Xue, Yanfei [1 ]
Song, Heping [1 ]
机构
[1] Jiangsu Univ, Dept Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China
关键词
center loss; cross-lingual; domain adaptation; speech emotion recognition; triple attentive asymmetric; DOMAIN ADAPTATION; FEATURES;
D O I
10.1002/int.22291
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The application of cross-corpus for speech emotion recognition (SER) via domain adaptation methods have gain high acknowledgment for developing good robust emotion recognition systems using different corpora or datasets. However, the issue of cross-lingual still remains a challenge in SER and needs more attention to resolve the scenario of applying different language types in both training and testing. In this paper, we propose a triple attentive asymmetric convolutional neural network to address the recognition of emotions for cross-lingual and cross-corpus speech in an unsupervised approach. The proposed method adopts the joint supervision of softmax loss and center loss to learn high power discriminative feature representations for target domain via the use of high quality pseudo-labels. The proposed model uses three attentive convolutional neural networks asymmetrically, where two of the networks are used to artificially label unlabeled target samples as a result of their predictions from training on source labeled samples and the other network is used to obtain salient target discriminative features from the pseudo-labeled target samples. We evaluate our proposed method on three different language types (i.e., English, German, and Italian) data sets. The experimental results indicate that, our proposed method achieves higher prediction accuracy over other state-of-the-art methods.
引用
收藏
页码:53 / 71
页数:19
相关论文
共 66 条
  • [1] Domain Adversarial for Acoustic Emotion Recognition
    Abdelwahab, Mohammed
    Busso, Carlos
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (12) : 2423 - 2435
  • [2] Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011
    Anagnostopoulos, Christos-Nikolaos
    Iliou, Theodoros
    Giannoukos, Ioannis
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) : 155 - 177
  • [3] [Anonymous], 2016, ARXIV160101577
  • [4] Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125
  • [5] Batliner A, 2011, COGN TECHNOL, P71, DOI 10.1007/978-3-642-15184-2_6
  • [6] A theory of learning from different domains
    Ben-David, Shai
    Blitzer, John
    Crammer, Koby
    Kulesza, Alex
    Pereira, Fernando
    Vaughan, Jennifer Wortman
    [J]. MACHINE LEARNING, 2010, 79 (1-2) : 151 - 175
  • [7] Learning Deep Architectures for AI
    Bengio, Yoshua
    [J]. FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01): : 1 - 127
  • [8] A bidirectional trace simplification approach based on a context switch linked list for concurrent programs
    Bo, Lili
    Jiang, Shujuan
    Wang, Rongcun
    Yu, Qiao
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (02)
  • [9] Burkhardt F., 2005, INTERSPEECH, P1517
  • [10] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359