Improving Gender Identification in Movie Audio using Cross-Domain Data

被引：9

作者：

Hebbar, Rajat ^{[1
]}

Somandepalli, Krishna ^{[1
]}

Narayanan, Shrikanth ^{[1
]}

机构：

[1] Univ Southern Calif, Signal Anal & Interpretat Lab, Dept Elect Engn, Los Angeles, CA 90007 USA

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

关键词：

gender identification; voice activity detection; deep neural networks; recurrent neural networks; transfer learning; bi-directional long short-term memory; RECOGNITION; SPEECH;

D O I：

10.21437/Interspeech.2018-1462

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Gender identification from audio is an important task for quantitative gender analysis in multimedia, and to improve tasks like speech recognition. Robust gender identification requires speech segmentation that relies on accurate voice activity detection (VAD). These tasks are challenging in movie audio due to diverse and often noisy acoustic conditions. In this work, we acquire VAD labels for movie audio by aligning it with subtitle text, and train a recurrent neural network model for VAD. Subsequently, we apply transfer learning to predict gender using feature embeddings obtained from a model pre-trained for large-scale audio classification. In order to account for the diverse acoustic conditions in movie audio, we use audio clips from YouTube labeled for gender. We compare the performance of our proposed method with baseline experiments that were setup to assess the importance of feature embeddings and training data used for gender identification task. For systematic evaluation, we extend an existing benchmark dataset for movie VAD, to include precise gender labels. The VAD system shows comparable results to state-of-the-art in movie domain. The proposed gender identification system outperforms existing baselines, achieving an accuracy of 85% for movie audio. We have made the data and related code publicly available(1).

引用

页码：282 / 286

页数：5

共 33 条

[1]

[Anonymous], ICME

[2]

[Anonymous], YOUTUBE DL DOWNLOAD

[3]

[Anonymous], IMPROVING SPEECH REC

[4]

[Anonymous], 1993, NASA STI RECON TECHN

[5]

[Anonymous], PATTERN RECOGN LETT

[6]

[Anonymous], 2014, ABS14126980 CORR

[7]

[Anonymous], HLT

[8]

[Anonymous], PROC INTERSPEECH 201

[9]

[Anonymous], 2010, ICML

[10]

[Anonymous], 2006 IEEE OD SPEAK L

← 1 2 3 4 →