EmoMV: Affective music-video correspondence learning datasets for classification and retrieval

被引:11
作者
Thao, Ha Thi Phuong [1 ]
Roig, Gemma [2 ]
Herremans, Dorien [1 ]
机构
[1] Singapore Univ Technol & Design, 8 Somapah Rd, Singapore 48737, Singapore
[2] Goethe Univ Frankfurt, Dept Comp Sci, Robert Mayer Str 11-15, D-60323 Frankfurt, Germany
关键词
Multi-task learning deep neural networks; Affective audio-visual correspondence learning; Emotion-based matching; Affective music-video retrieval; EmoMV dataset collection; Affective computing; EMOTION; ATTENTION;
D O I
10.1016/j.inffus.2022.10.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Studies in affective audio-visual correspondence learning require ground-truth data to train, validate, and test models. The number of available datasets together with benchmarks, however, is still limited. In this paper, we create a collection of three datasets (called EmoMV) for affective correspondence learning between music and video modalities. The first two datasets (called EmoMV-A, and EmoMV-B, respectively) are constructed by making use of music video segments from other available datasets. The third one called EmoMV-C is created from music videos that we self-collected from YouTube. The music-video pairs in our datasets are annotated as matched or mismatched in terms of the emotions they are conveying. The emotions are annotated by humans in the EmoMV-A dataset, while in the EmoMV-B and EmoMV-C datasets they are predicted using a pretrained deep neural network. A user study is carried out to evaluate the accuracy of the "matched"and "mismatched"labels offered in the EmoMV dataset collection. In addition to creating three new datasets, a benchmark deep neural network model for binary affective music-video correspondence classification is also proposed. This proposed benchmark model is then modified to adapt to affective music-video retrieval. Extensive experiments are carried out on all three datasets of the EmoMV collection. Experimental results demonstrate that our proposed model outperforms state-of-the-art approaches on both the binary classification and retrieval tasks. We envision that our newly created dataset collection together with the proposed benchmark models will facilitate advances in affective computing research.
引用
收藏
页码:64 / 79
页数:16
相关论文
共 102 条
[1]  
Rusu AA, 2016, Arxiv, DOI arXiv:1511.06295
[2]   Studying emotion induced by music through a crowdsourcing game [J].
Aljanaki, Anna ;
Wiering, Frans ;
Veltkamp, Remco C. .
INFORMATION PROCESSING & MANAGEMENT, 2016, 52 (01) :115-128
[3]  
[Anonymous], 2010, PROC ISMIR
[4]  
[Anonymous], 2013, P AAAI C ART INT, DOI DOI 10.1609/AAAI.V27I1.8603
[5]  
[Anonymous], 2015, BMVC
[6]  
[Anonymous], HDB COGNITION EMOTIO
[7]  
[Anonymous], 2006, 2006 IEEE COMP SOC C
[8]   Objects that Sound [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :451-466
[9]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[10]  
Aytar Y, 2016, ADV NEUR IN, V29