Deep learning-based late fusion of multimodal information for emotion classification of music video

被引:124
作者
Pandeya, Yagya Raj [1 ]
Lee, Joonwhoan [1 ]
机构
[1] Jeonbuk Natl Univ, Div Comp Sci & Engn, Jeonju, South Korea
基金
新加坡国家研究基金会;
关键词
Emotion classification; Music video dataset; CNN; Multimodal approach; Late fusion; NEURAL-NETWORKS; REPRESENTATION; MODEL; RECOGNITION; CATEGORIES; SPACE;
D O I
10.1007/s11042-020-08836-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer's internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.
引用
收藏
页码:2887 / 2905
页数:19
相关论文
共 68 条
[1]  
[Anonymous], 2015, ARXIV150301800V2
[2]  
[Anonymous], 2017, ARXIV170405665
[3]  
Bahuleyan H, 2018, ARXIV180401149V1
[4]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[5]  
Bertin-Mahieux T., 2011, P 12 INT SOC MUSIC I
[6]   Large-Scale Machine Learning with Stochastic Gradient Descent [J].
Bottou, Leon .
COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, :177-186
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]  
Chang WY, 2017, 21607516 IEEE
[9]   A Survey on Deep Transfer Learning [J].
Tan, Chuanqi ;
Sun, Fuchun ;
Kong, Tao ;
Zhang, Wenchang ;
Yang, Chao ;
Liu, Chunfang .
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT III, 2018, 11141 :270-279
[10]   Self-report captures 27 distinct categories of emotion bridged by continuous gradients [J].
Cowen, Alan S. ;
Keltner, Dacher .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (38) :E7900-E7909