Rethinking Auditory Affective Descriptors Through Zero-Shot Emotion Recognition in Speech

被引:7
作者
Xu, Xinzhou [1 ,3 ]
Deng, Jun [2 ]
Zhang, Zixing [4 ]
Fan, Xijian [5 ]
Zhao, Li [6 ]
Devillers, Laurence [7 ]
Schuller, Bjoern W. [3 ,4 ]
机构
[1] Nanjing Univ Posts & Telecommun, Sch Internet Things, Nanjing 210003, Peoples R China
[2] Agile Robots AG, D-81477 Munich, Germany
[3] Univ Augsburg, Chair Embedded Intelligence Hlth Care & Wellbeing, D-86159 Augsburg, Germany
[4] Imperial Coll London, Grp Language Audio & Mus GLAM, London SW7 2BX, England
[5] Nanjing Forestry Univ, Coll Informat Sci & Technol, Nanjing 210042, Peoples R China
[6] Southeast Univ, Sch Informat Sci & Engn, Nanjing 210096, Peoples R China
[7] Sorbonne Univ, CNRS LISN, F-75006 Paris, France
关键词
Prototypes; Annotations; Speech recognition; Emotion recognition; Semantics; Training; Task analysis; Auditory affective descriptors (AADs); semantic-embedding prototypes; speech emotion recognition (SER); zero-shot emotion recognition; ATTENTION; FRAMEWORK; MACHINE; AUDIO;
D O I
10.1109/TCSS.2021.3130401
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Zero-shot speech emotion recognition (SER) endows machines with the ability of sensing unseen-emotional states in speech, compared with conventional SER endeavors on supervised cases. On addressing the zero-shot SER task, auditory affective descriptors (AADs) are typically employed to transfer affective knowledge from seen- to unseen-emotional states. However, it remains unknown which types of AADs can well describe emotional states in speech during the transfer. In this regard, we define and research on three types of AADs, namely, per-emotion semantic-embedding, per-emotion manually annotated, and per-sample manually annotated AADs, through zero-shot emotion recognition in speech. This leads to a systematic design including prototype- and annotation-based zero-shot SER modules, relying on the input from per-emotion and per-sample AADs, respectively. We then perform extensive experimental comparisons between human and machines' AADs on the French emotional speech corpus CINEMO for positive-negative (PN) and within-negative (WN) tasks. The experimental results indicate that semantic-embedding prototypes from pretrained models can outperform manually annotated emotional dimensions in zero-shot SER. The results further demonstrate that it is possible for machines to understand and describe affective information in speech better than human beings, with the help of sufficient pretrained models.
引用
收藏
页码:1530 / 1541
页数:12
相关论文
共 73 条
[1]   Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].
Albanie, Samuel ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301
[2]   Prototype-Based Domain Description for One-Class Classification [J].
Angiulli, Fabrizio .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (06) :1131-1144
[3]  
[Anonymous], 2015, PROC CVPR IEEE
[4]  
Bojanowski P., 2017, Trans. Assoc. Comput. Linguistics, V5, P135, DOI [DOI 10.1162/TACLA00051, 10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051]
[5]  
Brendel M, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2205
[6]   SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis [J].
Cambria, Erik ;
Li, Yang ;
Xing, Frank Z. ;
Poria, Soujanya ;
Kwok, Kenneth .
CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, :105-114
[7]  
Cambria E, 2018, AAAI CONF ARTIF INTE, P1795
[8]  
Campos V, 2019, COMPUT VIS PATT REC, P349, DOI 10.1016/B978-0-12-814601-9.00018-3
[9]   Classifier and Exemplar Synthesis for Zero-Shot Learning [J].
Changpinyo, Soravit ;
Chao, Wei-Lun ;
Gong, Boqing ;
Sha, Fei .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (01) :166-201
[10]   Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning [J].
Changpinyo, Soravit ;
Chao, Wei-Lun ;
Sha, Fei .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3496-3505