Solos: A Dataset for Audio-Visual Music Analysis

被引：2

作者：

Montesinos, Juan F. ^{[1
]}

Slizovskaia, Olga ^{[1
]}

Haro, Gloria ^{[1
]}

机构：

[1] Univ Pompeu Fabra, Dept Informat & Commun Technol, Barcelona, Spain

来源：

2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP) | 2020年

基金：

欧盟地平线“2020”;

关键词：

audio-visual; dataset; multimodal; music; SOURCE SEPARATION; AUDIO;

D O I：

10.1109/mmsp48831.2020.9287124

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a big amount of its recordings are auditions and manually checked recordings, ensuring there is no background noise nor effects added in the video post-processing. Besides, it is, up to the best of our knowledge, the only dataset that contains the whole set of instruments present in the URMP [1] dataset, a high-quality dataset of 44 audio-visual recordings of multi-instrument classical music pieces with individual audio tracks. URMP was intented to be used for source separation, thus, we evaluate the performance on the URMP dataset of two different source-separation models trained on Solos. The dataset is publicly available at https://juanfmontesinos.github.io/Solos/

引用

页数：6

共 43 条

[1]

[Anonymous], P IEEE C COMP VIS PA

[2]

[Anonymous], 2017, P SOUND MUS COMP

[3]

[Anonymous], 2015, ACS SYM SER

[4]

Arandjelovic R., 2018, LECT NOTES COMPUT SC, DOI DOI 10.1007/978-3-030-01246-5_27

[5] OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].

Cao, Zhe ;

Hidalgo, Gines ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186

[6] Monoaural Audio Source Separation Using Deep Convolutional Neural Networks [J].

Chandna, Pritish ;

Miron, Marius ;

Janer, Jordi ;

Gomez, Emilia .

LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 :258-266

[7] Deep Cross-Modal Audio-Visual Generation [J].

Chen, Lele ;

Srivastava, Sudhanshu ;

Duan, Zhiyao ;

Xu, Chenliang .

PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, :349-357

[8] SOME EXPERIMENTS ON THE RECOGNITION OF SPEECH, WITH ONE AND WITH 2 EARS [J].

CHERRY, EC .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1953, 25 (05) :975-979

[9]

Darrell T, 2000, LECT NOTES COMPUT SC, V1948, P32

[10]

Dixon S, 2018, P ISMIR 2018

← 1 2 3 4 5 →