Self-Supervised Learning of Multi-Level Audio Representations for Music Segmentation

被引：0

作者：

Buisson, Morgan ^{[1
]}

McFee, Brian ^{[2
]}

Essid, Slim ^{[1
]}

Crayencour, Helene C. ^{[3
]}

机构：

[1] Telecom Paris, F-91120 Palaiseau, France

[2] NYU, New York, NY 10012 USA

[3] Univ Paris Sud, Cent Supelec, CNRS, F-91190 Gif Sur Yvette, France

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Music; Annotations; Task analysis; Training; Feature extraction; Self-supervised learning; Artificial neural networks; Music structure analysis; structural segmentation; representation learning;

D O I：

10.1109/TASLP.2024.3379894

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The task of music structure analysis refers to automatically identifying the location and the nature of musical sections within a song. In the supervised scenario, structural annotations generally result from exhaustive data collection processes, which represents one of the main challenges of this task. Moreover, both the subjectivity of music structure and the hierarchical characteristics it exhibits make the obtained structural annotations not fully reliable, in the sense that they do not convey a "universal ground-truth" unlike other tasks in music information retrieval. On the other hand, the quickly growing quantity of available music data has enabled weakly supervised and self-supervised approaches to achieve impressive results on a wide range of music-related problems. In this work, a self-supervised learning method is proposed to learn robust multi-level music representations prior to structural segmentation using contrastive learning. To this end, sets of frames sampled at different levels of detail are used to train a deep neural network in a disentangled manner. The proposed method is evaluated on both flat and multi-level segmentation. We show that each distinct sub-region of the output embeddings can efficiently account for structural similarity at their own targeted level of detail, which ultimately improves performance of downstream flat and multi-level segmentation. Finally, complementary experiments are carried out to study how the obtained representations can be further adapted to specific datasets using a supervised fine-tuning objective in order to facilitate structure retrieval in domains where human annotations remain scarce.

引用

页码：2141 / 2152

页数：12

共 38 条

[1] [Anonymous], 2017, PROC INT SOC MUSIC I
[2] Balke S., 2022, T INT SOC MUSIC INF, V5, P156
[3] Measuring Structural Similarity in Music
Bello, Juan P.
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07): : 2013 - 2025
[4] Bock Sebastian, 2016, P 24 ACM INT C MULT, P1174
[5] The perception of structural boundaries in melody lines of Western popular music
Bruderer, Michael J.
McKinney, Martin F.
Kohlrausch, Armin
[J]. MUSICAE SCIENTIAE, 2009, 13 (02) : 273 - 313
[6] Crayencour H. C., 2022, P 23 INT SOC MUS INF, P591
[7] Dai H. Yu, 2022, INT SOC MUSIC INF RE, P659
[8] Doras G, 2020, INT CONF ACOUST SPEE, P3797, DOI [10.1109/icassp40776.2020.9054619, 10.1109/ICASSP40776.2020.9054619]
[9] Kim J., 2023, IEEE WORKSHOPAPPL SI, P1
[10] Kinnaird K. M, 2018, PROC INT SOC MUSIC I, P585

← 1 2 3 4 →