Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model for Music Emotion Recognition

被引：58

作者：

Dong, Yizhuo ^{[1
]}

Yang, Xinyu ^{[1
]}

Zhao, Xi ^{[1
]}

Li, Juan ^{[2
]}

机构：

[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Peoples R China

[2] Xi An Jiao Tong Univ, Mus Educ Ctr, Xian 710049, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2019年 / 21卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Music emotion recognition; bidirectional convolutional recurrent sparse network; sequential-information-included affect-salient features selection; long short-term memory; Lasso regression; FEATURES;

D O I：

10.1109/TMM.2019.2918739

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Music emotion recognition, which enables effective and efficient music organization and retrieval, is a challenging subject in the field of music information retrieval. In this paper, we propose a new bidirectional convolutional recurrent sparse network (BCRSN) for music emotion recognition based on convolutional neural networks and recurrent neural networks. Our model adaptively learns the sequential-information-included affect-salient features (SII-ASF) from the 2-D time-frequency representation (i.e., spectrogram) of music audio signals. By combining feature extraction, ASF selection, and emotion prediction, the BCRSN can achieve continuous emotion prediction of audio files. To reduce the high computational complexity caused by the numerical-type ground truth, we propose a weighted hybrid binary representation (WHBR) method that converts the regression prediction process into a weighted combination of multiple binary classification problems. We test our method on two benchmark databases, that is, the Database for Emotional Analysis in Music and MoodSwings Turk. The results show that the WHBR method can greatly reduce the training time and improve the prediction accuracy. The extracted SII-ASF is robust to genre, timbre, and noise variation and is sensitive to emotion. It achieves significant improvement compared to the best performing feature sets in MediaEval 2015. Meanwhile, extensive experiments demonstrate that the proposed method outperforms the state-of-the-art methods.

引用

页码：3150 / 3163

页数：14

共 46 条

[1] Developing a benchmark for emotional analysis of music [J].

Aljanaki, Anna ;

Yang, Yi-Hsuan ;

Soleymani, Mohammad .

PLOS ONE, 2017, 12 (03)

[2]

[Anonymous], 2016, P 25 INT JOINT C ART

[3]

[Anonymous], 2016, IEEE Energy Conversion Congress and Exposition ECCE

[4]

[Anonymous], P 5 INT WORKSH AUD V

[5] Dynamic Selection of Trace Signals for Post-Silicon Debug [J].

Basu, Kanad ;

Mishra, Prabhat ;

Patra, Priyadarsan ;

Nahir, Amir ;

Aadir, Alon .

2013 14TH INTERNATIONAL WORKSHOP ON MICROPROCESSOR TEST AND VERIFICATION (MTV): COMMON CHALLENGES AND SOLUTIONS, 2013, :62-67

[6] Maxout neurons for deep convolutional and LSTM neural networks in speech recognition [J].

Cai, Meng ;

Liu, Jia .

SPEECH COMMUNICATION, 2016, 77 :53-64

[7]

Chen C., 2016, P 24 ACM INT C MULT, P127, DOI DOI 10.1145/2964284.2967196

[8]

Chen S., 2017, P 7 ANN WORKSH AUD V, P19, DOI 10.1145/3133944.3133949

[9]

Choi K, 2017, INT CONF ACOUST SPEE, P2392, DOI 10.1109/ICASSP.2017.7952585

[10] Exploring User Experiences of Active Workstations: A Case Study of Under Desk Elliptical Trainers [J].

Choi, Woohyeok ;

Song, Aejin ;

Edge, Darren ;

Fukumoto, Masaaki ;

Lee, Uichin .

UBICOMP'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING, 2016, :805-816

← 1 2 3 4 5 →