AGE-VOX-CELEB: MULTI-MODAL CORPUS FOR FACIAL AND SPEECH ESTIMATION

被引：14

作者：

Tawara, Naohiro ^{[1
]}

Ogawa, Atsunori ^{[1
]}

Kitagishi, Yuki ^{[1
]}

Kamiyama, Hosana ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Speech and facial age estimation; x-vector; squeeze-and-excitation network; cross-modal learning;

D O I：

10.1109/ICASSP39728.2021.9414272

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Estimating a speaker's age from their speech is more challenging than age estimation from their face because of insufficiently available public corpora. To tackle this problem, we construct a new audio-visual age corpus named AgeVoxCeleb by annotating age labels to VoxCeleb2 videos. AgeVoxCeleb is the first large-scale, balanced, and multi-modal age corpus that contains both video and speech of the same speakers from a wide age range. Using AgeVoxCeleb, our paper makes the following contributions: (i) A facial age estimation model can outperform a speech age estimation model by comparing the state-of-the-art models in each task. (ii) Facial age estimation is more robust against the difference between training and test sets. (iii) We developed cross-modal transfer learning from face to speech age estimation, showing that the estimated age with a facial age estimation model can be used to train a speech age estimation model. Proposed AgeVoxCeleb will be published in https://github.com/nttcslab-sp/agevoxceleb.

引用

页码：6963 / 6967

页数：5

共 31 条

[1]

Afouras T., 2018, arXiv preprint arXiv 1809. 02108

[2]

Agustsson E, 2017, P FG

[3] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].

Albanie, Samuel ;

Nagrani, Arsha ;

Vedaldi, Andrea ;

Zisserman, Andrew .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301

[4] Age estimation via face images: a survey [J].

Angulu, Raphael ;

Tapamo, Jules R. ;

Adewumi, Aderemi O. .

EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2018,

[5]

[Anonymous], 2019, P ICASSP

[6]

Bahari MH, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P506

[7] VGGFace2: A dataset for recognising faces across pose and age [J].

Cao, Qiong ;

Shen, Li ;

Xie, Weidi ;

Parkhi, Omkar M. ;

Zisserman, Andrew .

PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :67-74

[8]

Chen BC, 2014, LECT NOTES COMPUT SC, V8694, P768, DOI 10.1007/978-3-319-10599-4_49

[9]

Chung JS, 2018, INTERSPEECH, P1086

[10]

Fariza Arna, 2019, 2019 International Electronics Symposium (IES). Proceedings, P607, DOI 10.1109/ELECSYM.2019.8901521

← 1 2 3 4 →