Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

被引:176
作者
Albanie, Samuel [1 ]
Nagrani, Arsha [1 ]
Vedaldi, Andrea [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
基金
英国工程与自然科学研究理事会;
关键词
Cross-modal transfer; speech emotion recognition; FACE-LIKE STIMULI; FACIAL-EXPRESSION; PERCEPTION; VOICE;
D O I
10.1145/3240508.3240578
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available(1).
引用
收藏
页码:292 / 301
页数:10
相关论文
共 66 条
  • [61] Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies
    Schuller, Bjoern
    Vlasenko, Bogdan
    Eyben, Florian
    Woellmer, Martin
    Stuhlsatz, Andre
    Wendemuth, Andreas
    Rigoll, Gerhard
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2010, 1 (02) : 119 - 131
  • [62] Facial expression and prosodic prominence: Effects of modality and facial area
    Swerts, Marc
    Krahmer, Emiel
    [J]. JOURNAL OF PHONETICS, 2008, 36 (02) : 219 - 238
  • [63] Yu ZD, 2015, ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, P435
  • [64] Multimodal emotion recognition based on peak frame selection from video
    Zhalehpour, Sara
    Akhtar, Zahid
    Erdem, Cigdem Eroglu
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2016, 10 (05) : 827 - 834
  • [65] Learning Visual Emotion Distributions via Multi-Modal Features Fusion
    Zhao, Sicheng
    Ding, Guiguang
    Gao, Yue
    Han, Jungong
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 369 - 377
  • [66] Zixing Zhang, 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), P523, DOI 10.1109/ASRU.2011.6163986