Motivic Pattern Classification of Music Audio Signals Combining Residual and LSTM Networks

被引：18

作者：

Arronte Alvarez, Aitor ^{[1
,2
]}

Gomez, Francisco ^{[1
]}

机构：

[1] Univ Politecn Madrid, Madrid, Spain

[2] Univ Hawaii Manoa, Ctr Language & Technol, Honolulu, HI 96822 USA

来源：

INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE | 2021年 / 6卷 / 06期

关键词：

Motivic Patterns; Convolutional Neural Networks; Data Augmentation; Audio Signal Processing; Music Information Retrieval;

D O I：

10.9781/ijimai.2021.01.003

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Motivic pattern classification from music audio recordings is a challenging task. More so in the case of a cappella flamenco cantes, characterized by complex melodic variations, pitch instability, timbre changes, extreme vibrato oscillations, microtonal ornamentations, and noisy conditions of the recordings. Convolutional Neural Networks (CNN) have proven to be very effective algorithms in image classification. Recent work in large-scale audio classification has shown that CNN architectures, originally developed for image problems, can be applied successfully to audio event recognition and classification with little or no modifications to the networks. In this paper, CNN architectures are tested in a more nuanced problem: flamenco cantes intra-style classification using small motivic patterns. A new architecture is proposed that uses the advantages of residual CNN as feature extractors, and a bidirectional LSTM layer to exploit the sequential nature of musical audio data. We present a full end-to-end pipeline for audio music classification that includes a sequential pattern mining technique and a contour simplification method to extract relevant motifs from audio recordings. Mel-spectrograms of the extracted motifs are then used as the input for the different architectures tested. We investigate the usefulness of motivic patterns for the automatic classification of music recordings and the effect of the length of the audio and corpus size on the overall classification accuracy. Results show a relative accuracy improvement of up to 20.4% when CNN architectures are trained using acoustic representations from motivic patterns.

引用

页码：208 / 214

页数：7

共 35 条

[1]

[Anonymous], 2012, P 12 INT C MUS PERC

[2]

Choi K, 2017, ARXIV170309179, P141, DOI DOI 10.48550/ARXIV.1703.09179

[3]

Choi K., ARXIV PREPRINT ARXIV

[4]

Choi K, 2017, INT CONF ACOUST SPEE, P2392, DOI 10.1109/ICASSP.2017.7952585

[5] Pattern discovery techniques for music audio [J].

Dannenberg, RB ;

Hu, N .

JOURNAL OF NEW MUSIC RESEARCH, 2003, 32 (02) :153-163

[6] Fitting rectilinear polygonal curves to a set of points in the plane [J].

Díaz-Báñez, JM ;

Mesa, JA .

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2001, 130 (01) :214-222

[7]

Dieleman Sander, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6964, DOI 10.1109/ICASSP.2014.6854950

[8]

Durand S, 2015, INT CONF ACOUST SPEE, P409, DOI 10.1109/ICASSP.2015.7178001

[9]

Font Frederic, 2016, P 17 INT SOC MUS INF, P269

[10]

Gomez F., MATH MUSIC THEORY AL, P303

← 1 2 3 4 →