Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model for Music Emotion Recognition

被引：58

作者：

Dong, Yizhuo ^{[1
]}

Yang, Xinyu ^{[1
]}

Zhao, Xi ^{[1
]}

Li, Juan ^{[2
]}

机构：

[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Peoples R China

[2] Xi An Jiao Tong Univ, Mus Educ Ctr, Xian 710049, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2019年 / 21卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Music emotion recognition; bidirectional convolutional recurrent sparse network; sequential-information-included affect-salient features selection; long short-term memory; Lasso regression; FEATURES;

D O I：

10.1109/TMM.2019.2918739

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Music emotion recognition, which enables effective and efficient music organization and retrieval, is a challenging subject in the field of music information retrieval. In this paper, we propose a new bidirectional convolutional recurrent sparse network (BCRSN) for music emotion recognition based on convolutional neural networks and recurrent neural networks. Our model adaptively learns the sequential-information-included affect-salient features (SII-ASF) from the 2-D time-frequency representation (i.e., spectrogram) of music audio signals. By combining feature extraction, ASF selection, and emotion prediction, the BCRSN can achieve continuous emotion prediction of audio files. To reduce the high computational complexity caused by the numerical-type ground truth, we propose a weighted hybrid binary representation (WHBR) method that converts the regression prediction process into a weighted combination of multiple binary classification problems. We test our method on two benchmark databases, that is, the Database for Emotional Analysis in Music and MoodSwings Turk. The results show that the WHBR method can greatly reduce the training time and improve the prediction accuracy. The extracted SII-ASF is robust to genre, timbre, and noise variation and is sensitive to emotion. It achieves significant improvement compared to the best performing feature sets in MediaEval 2015. Meanwhile, extensive experiments demonstrate that the proposed method outperforms the state-of-the-art methods.

引用

页码：3150 / 3163

页数：14

共 46 条

[11] Emotional States Associated with Music: Classification, Prediction of Changes, and Consideration in Recommendation [J].

Deng, James J. ;

Leung, Clement H. C. ;

Milani, Alfredo ;

Chen, Li .

ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS, 2015, 5 (01)

[12]

Deshpande H., 2001, P COST G6 C DIG AUD, P1

[13]

Dieleman Sander, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6964, DOI 10.1109/ICASSP.2014.6854950

[14]

Dieleman S., 2011, Proceedings of the 12th International Society for Music Information Retrieval Conference, P669, DOI [DOI 10.1016/J.KNOSYS.2018.07.033, DOI 10.5281/ZENODO.1415188, 10.5281/zenodo.1415188]

[15]

Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878

[16]

Fan J Y, 2017, INT SOC MUSIC INF RE, P368

[17] Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks [J].

Fan, Yin ;

Lu, Xiangju ;

Li, Dian ;

Liu, Yuanliu .

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, :445-450

[18] A Survey of Audio-Based Music Classification and Annotation [J].

Fu, Zhouyu ;

Lu, Guojun ;

Ting, Kai Ming ;

Zhang, Dengsheng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2011, 13 (02) :303-319

[19]

Grill T, 2015, EUR SIGNAL PR CONF, P1296, DOI 10.1109/EUSIPCO.2015.7362593

[20] Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music [J].

Han, Yoonchang ;

Kim, Jaehun ;

Lee, Kyogu .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (01) :208-221

← 1 2 3 4 5 →