Emotion, age, and gender classification in children's speech by humans and machines

被引:36
作者
Kaya, Heysem [1 ]
Salah, Albert Ali [2 ]
Karpovc, Alexey [3 ,4 ]
Frolova, Olga [5 ]
Grigorev, Aleksey [5 ]
Lyakso, Elena [5 ]
机构
[1] Namik Kemal Univ, Dept Comp Engn, Corlu, Tekirdag, Turkey
[2] Bogazici Univ, Dept Comp Engn, Istanbul, Turkey
[3] Russian Acad Sci, St Petersburg Inst Informat & Automat, Speech & Multimodal Interfaces Lab, St Petersburg, Russia
[4] ITMO Univ, Dept Speech Informat Syst, St Petersburg, Russia
[5] St Petersburg State Univ, Child Speech Res Grp, St Petersburg, Russia
基金
俄罗斯基础研究基金会;
关键词
Emotional child speech; Perception experiments; Spectrographic analysis; Emotional states; Age recognition; Gender recognition; Computational paralinguistics; RECOGNITION;
D O I
10.1016/j.csl.2017.06.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article, we present the first child emotional speech corpus in Russian, called "EmoChildRu", collected from 3 to 7 years old children. The base corpus includes over 20 K recordings (approx. 30 h), collected from 120 children. Audio recordings are carried out in three controlled settings by creating different emotional states for children: playing with a standard set of toys; repetition of words from a toy-parrot in a game store setting; watching a cartoon and retelling of the story, respectively. This corpus is designed to study the reflection of the emotional state in the characteristics of voice and speech and for studies of the formation of emotional states in ontogenesis. A portion of the corpus is annotated for three emotional states (comfort, discomfort, neutral). Additional data include the results of the adult listeners' analysis of child speech, questionnaires, as well as annotation for gender and age in months. We also provide several baselines, comparing human and machine estimation on this corpus for prediction of age, gender and comfort state. While in age estimation, the acoustics-based automatic systems show higher performance, they do not reach human perception levels in comfort state and gender classification. The comparative results indicate the importance and necessity of developing further linguistic models for discrimination. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:268 / 283
页数:16
相关论文
共 48 条
[1]  
[Anonymous], 2011, ACM T SPEECH LANGUAG
[2]  
[Anonymous], P ANN C INT SPEECH C
[3]  
Batliner A., 2005, P INTERSPEECH, P2761
[4]  
Batliner A., 2008, PROGR WORKSH CORP RE, P28
[5]  
Bell L., 2005, INT EUR, P2765
[6]  
Burkhardt F, 2005, INTERSPEECH, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[7]  
Engberg I.S., 1996, DOCUMENTATION DANISH, P1
[8]  
Eyben F., 2010, Proceedings of the 18th ACM international conference on Multimedia, DOI [10.1186/s40345-020-00210-4, DOI 10.1186/S40345-020-00210-4, DOI 10.1145/1873951.1874246]
[9]  
Fringi E, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P1621
[10]   Acoustic variability and automatic recognition of children's speech [J].
Gerosa, Matteo ;
Giuliani, Diego ;
Brugnara, Fabio .
SPEECH COMMUNICATION, 2007, 49 (10-11) :847-860