Pa(sic)-HuBERT: SELF-SUPERVISED MUSIC SOURCE SEPARATION VIA PRIMITIVE AUDITORY CLUSTERING AND HIDDEN-UNIT BERT

被引:1
作者
Chen, Ke [1 ,2 ]
Wichern, Gordon [1 ]
Germain, Francois G. [1 ]
Le Roux, Jonathan [1 ]
机构
[1] MERL, Cambridge, MA 02139 USA
[2] UCSD, La Jolla, CA 92093 USA
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年
关键词
Music source separation; primitive auditory principles; self-supervised Learning; BERT; AUDIO REPRESENTATIONS; EXTRACTION; MODEL;
D O I
10.1109/ICASSPW59220.2023.10193575
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning framework for music source separation inspired by the HuBERT speech representation model. We first investigate the potential impact of the original HuBERT model by inserting an adapted version of it into the well-known Demucs V2 time-domain separation architecture. We then propose Pa(sic)-HuBERT, a time-frequency-domain self-supervised model, that we later use in combination with a ResU-Net decoder for source separation. Pa(sic)-HuBERT uses primitive auditory features of music as unsupervised clustering labels to initialize the self-supervised pretraining process using the Free Music Archive (FMA) dataset. The resulting framework achieves better source-to-distortion ratio (SDR) performance on the MusDB18 test set than the original Demucs V2 and Res-U-Net models. We further demonstrate that it can boost performance with small amounts of supervised data. Ultimately, our proposed framework is an effective solution to the challenge of limited clean source data for music source separation.
引用
收藏
页数:5
相关论文
共 45 条
[1]  
[Anonymous], 2012, P 13 INT SOC MUS INF
[2]   Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction [J].
Becerra, Helard ;
Ragano, Alessandro ;
Hines, Andrew .
INTERSPEECH 2022, 2022, :4088-4092
[3]   Self-Supervised Learning of Audio Representations From Permutations With Differentiable Ranking [J].
Carr, Andrew N. ;
Berthet, Quentin ;
Blondel, Mathieu ;
Teboul, Olivier ;
Zeghidour, Neil .
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :708-712
[4]   HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION [J].
Chen, Ke ;
Du, Xingjian ;
Zhu, Bilei ;
Ma, Zejun ;
Berg-Kirkpatrick, Taylor ;
Dubnov, Shlomo .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :646-650
[5]  
Chen S, 2022, arXiv
[6]   WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].
Chen, Sanyuan ;
Wang, Chengyi ;
Chen, Zhengyang ;
Wu, Yu ;
Liu, Shujie ;
Chen, Zhuo ;
Li, Jinyu ;
Kanda, Naoyuki ;
Yoshioka, Takuya ;
Xiao, Xiong ;
Wu, Jian ;
Zhou, Long ;
Ren, Shuo ;
Qian, Yanmin ;
Qian, Yao ;
Zeng, Michael ;
Yu, Xiangzhan ;
Wei, Furu .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518
[7]  
Defferrard M., 2017, P 18 INT SOC MUS INF, P316
[8]  
Défossez A, 2021, Arxiv, DOI arXiv:1911.13254
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Driedger J., 2014, ISMIR C, P611, DOI DOI 10.5281/ZENODO.1415226