A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification

被引：1

作者：

Chen, Jiyang ^{[1
,2
]}

Ma, Xiaohong ^{[2
]}

Li, Shikuan ^{[2
]}

Ma, Sile ^{[1
]}

Zhang, Zhizheng ^{[1
]}

Ma, Xiaojing ^{[1
]}

机构：

[1] Shandong Univ, Inst Marine Sci & Technol, Qingdao 266237, Peoples R China

[2] Shandong Zhengzhong Informat Technol Co Ltd, Jinan 250098, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 16期

关键词：

music genre classification; convolutional neural networks; Transformer encoder; mel spectrogram; NEURAL-NETWORK;

D O I：

10.3390/electronics13163313

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Music genre classification (MGC) is the basis for the efficient organization, retrieval, and recommendation of music resources, so it has important research value. Convolutional neural networks (CNNs) have been widely used in MGC and achieved excellent results. However, CNNs cannot model global features well due to the influence of the local receptive field; these global features are crucial for classifying music signals with temporal properties. Transformers can capture long-range dependencies within an image thanks to adopting the self-attention mechanism. Nevertheless, there are still performance and computational cost gaps between Transformers and existing CNNs. In this paper, we propose a hybrid architecture (CNN-TE) based on CNN and Transformer encoder for MGC. Specifically, we convert the audio signals into mel spectrograms and feed them into a hybrid model for training. Our model employs a CNN to initially capture low-level and localized features from the spectrogram. Subsequently, these features are processed by a Transformer encoder, which models them globally to extract high-level and abstract semantic information. This refined information is then classified using a multi-layer perceptron. Our experiments demonstrate that this approach surpasses many existing CNN architectures when tested on the GTZAN and FMA datasets. Notably, it achieves these results with fewer parameters and a faster inference speed.

引用

页数：13

共 33 条

[1] Machine Learning for Music Genre Classification Using Visual Mel Spectrum
Cheng, Yu-Huei
Kuo, Che-Nan
[J]. MATHEMATICS, 2022, 10 (23)
[2] Convolutional Neural Networks Approach for Music Genre Classification
Cheng, Yu-Huei
Chang, Pang-Ching
Kuo, Che-Nan
[J]. 2020 INTERNATIONAL SYMPOSIUM ON COMPUTER, CONSUMER AND CONTROL (IS3C 2020), 2021, : 399 - 403
[3] Choi K, 2017, INT CONF ACOUST SPEE, P2392, DOI 10.1109/ICASSP.2017.7952585
[4] Deepak S., 2020, Proceedings of Second International Conference on Inventive Research in Computing Applications (ICIRCA 2020), P985, DOI 10.1109/ICIRCA48905.2020.9182850
[5] Defferrard M, 2017, Arxiv, DOI [arXiv:1612.01840, DOI 10.48550/ARXIV.1612.01840]
[6] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[7] A Survey of Audio-Based Music Classification and Annotation
Fu, Zhouyu
Lu, Guojun
Ting, Kai Ming
Zhang, Dengsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2011, 13 (02) : 303 - 319
[8] Deep Learning Approaches in Topics of Singing Information Processing
Gupta, Chitralekha
Li, Haizhou
Goto, Masataka
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2422 - 2451
[9] A Survey on Vision Transformer
Han, Kai
Wang, Yunhe
Chen, Hanting
Chen, Xinghao
Guo, Jianyuan
Liu, Zhenhua
Tang, Yehui
Xiao, An
Xu, Chunjing
Xu, Yixing
Yang, Zhaohui
Zhang, Yiman
Tao, Dacheng
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 87 - 110
[10] Searching for MobileNetV3
Howard, Andrew
Sandler, Mark
Chu, Grace
Chen, Liang-Chieh
Chen, Bo
Tan, Mingxing
Wang, Weijun
Zhu, Yukun
Pang, Ruoming
Vasudevan, Vijay
Le, Quoc V.
Adam, Hartwig
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 1314 - 1324

← 1 2 3 4 →