AGE-VOX-CELEB: MULTI-MODAL CORPUS FOR FACIAL AND SPEECH ESTIMATION

被引:14
作者
Tawara, Naohiro [1 ]
Ogawa, Atsunori [1 ]
Kitagishi, Yuki [1 ]
Kamiyama, Hosana [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Speech and facial age estimation; x-vector; squeeze-and-excitation network; cross-modal learning;
D O I
10.1109/ICASSP39728.2021.9414272
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Estimating a speaker's age from their speech is more challenging than age estimation from their face because of insufficiently available public corpora. To tackle this problem, we construct a new audio-visual age corpus named AgeVoxCeleb by annotating age labels to VoxCeleb2 videos. AgeVoxCeleb is the first large-scale, balanced, and multi-modal age corpus that contains both video and speech of the same speakers from a wide age range. Using AgeVoxCeleb, our paper makes the following contributions: (i) A facial age estimation model can outperform a speech age estimation model by comparing the state-of-the-art models in each task. (ii) Facial age estimation is more robust against the difference between training and test sets. (iii) We developed cross-modal transfer learning from face to speech age estimation, showing that the estimated age with a facial age estimation model can be used to train a speech age estimation model. Proposed AgeVoxCeleb will be published in https://github.com/nttcslab-sp/agevoxceleb.
引用
收藏
页码:6963 / 6967
页数:5
相关论文
共 31 条
[1]  
Afouras T., 2018, arXiv preprint arXiv 1809. 02108
[2]  
Agustsson E, 2017, P FG
[3]   Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].
Albanie, Samuel ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301
[4]   Age estimation via face images: a survey [J].
Angulu, Raphael ;
Tapamo, Jules R. ;
Adewumi, Aderemi O. .
EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2018,
[5]  
[Anonymous], 2019, P ICASSP
[6]  
Bahari MH, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P506
[7]   VGGFace2: A dataset for recognising faces across pose and age [J].
Cao, Qiong ;
Shen, Li ;
Xie, Weidi ;
Parkhi, Omkar M. ;
Zisserman, Andrew .
PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :67-74
[8]  
Chen BC, 2014, LECT NOTES COMPUT SC, V8694, P768, DOI 10.1007/978-3-319-10599-4_49
[9]  
Chung JS, 2018, INTERSPEECH, P1086
[10]  
Fariza Arna, 2019, 2019 International Electronics Symposium (IES). Proceedings, P607, DOI 10.1109/ELECSYM.2019.8901521