Music genre classification based on res-gated CNN and attention mechanism

被引：5

作者：

Xie, Changjiang ^{[1
]}

Song, Huazhu ^{[1
,2
]}

Zhu, Hao ^{[1
]}

Mi, Kaituo ^{[2
]}

Li, Zhouhan ^{[1
]}

Zhang, Yi ^{[1
]}

Cheng, Jiawen ^{[1
]}

Zhou, Honglin ^{[1
]}

Li, Renjie ^{[1
]}

Cai, Haofeng ^{[1
]}

机构：

[1] Wuhan Univ Technol, Wuhan, Peoples R China

[2] Anngeen Technol Co Ltd, Wuhan, Peoples R China

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 05期

关键词：

Music genre classification; CNN; Transformer;

D O I：

10.1007/s11042-023-15277-1

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The amount of digital music available on the internet has grown significantly with the rapid development of digital multimedia technology. Managing these massive music resources is a thorny problem that powerful music media platforms need to face where music genre classification plays an important role, and a good music genre classifier is indispensable for the research and application of music resources in the related aspects, such as efficient organization, retrieval, recommendation, etc. Due to convolutional networks' powerful feature extraction capability, more and more researchers are devoting their efforts to music genre classification models based on convolutional neural networks (CNNs). However, many models do not combine the musical signal features for effective design of the convolutional structure, which cause a simpler convolutional network part of the model and weaker local feature extraction ability. To solve the above problem, our group proposes a model using a 1D res-gated CNN to extract local information of audio sequences rather than the traditional CNN architecture. Meanwhile, to aggregate the global information of audio feature sequences, our group applies the Transformer to the music genre classification model and modifies the decoder structure of the Transformer according to the task. The experiments utilize the benchmark datasets, including GTZAN and Extended Ballroom. Our group conducted contrastive experiments to verify our model, and experimental results demonstrated that our model outperforms most of the previous approaches and can improve the performance of music genre classification.

引用

页码：13527 / 13542

页数：16

共 53 条

[1] End-to-end environmental sound classification using a 1D convolutional neural network [J].

Abdoli, Sajjad ;

Cardinal, Patrick ;

Koerich, Alessandro Lameiras .

EXPERT SYSTEMS WITH APPLICATIONS, 2019, 136 :252-263

[2] Deep Scattering Spectrum [J].

Anden, Joakim ;

Mallat, Stephane .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2014, 62 (16) :4114-4128

[3]

[Anonymous], 2014, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings

[4]

[Anonymous], 2016, The extended ballroom dataset

[5]

Ba J. L., 2015, PROC INT C LEARN REP

[6]

Cano P., 2006, ISMIR 2004 Audio Description Contest

[7] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[8] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[9] DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET [J].

Chen, Xie ;

Wu, Yu ;

Wang, Zhenghao ;

Liu, Shujie ;

Li, Jinyu .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5904-5908

[10]

Cho K, 2014, P SSST 8 8 WORKSH SY, P103, DOI [10.3115/v1/W14-4012, DOI 10.3115/V1/W14-4012]

← 1 2 3 4 5 6 →