On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset

被引：153

作者：

Hsu, Chao-Ling ^{[1
]}

Jang, Jyh-Shing Roger ^{[1
]}

机构：

[1] Natl Tsing Hua Univ, Dept Comp Sci, MediaTek NTHU Joint Lab, Hsinchu 30013, Taiwan

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2010年 / 18卷 / 02期

关键词：

Computational auditory scene analysis (CASA); singing voice separation; unvoiced sound separation; SPEECH; SEGREGATION; MODELS;

D O I：

10.1109/TASL.2009.2026503

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Monaural singing voice separation is an extremely challenging problem. While efforts in pitch-based inference methods have led to considerable progress in voiced singing voice separation, little attention has been paid to the incapability of such methods to separate unvoiced singing voice due to its inharmonic structure and weaker energy. In this paper, we proposed a systematic approach to identify and separate the unvoiced singing voice from the music accompaniment. We have also enhanced the performance of separating voiced singing via a spectral subtraction method. The proposed system follows the framework of computational auditory scene analysis ( CASA) which consists of the segmentation stage and the grouping stage. In the segmentation stage, the input song signals are decomposed into small sensory elements in different time-frequency resolutions. The unvoiced sensory elements are then identified by Gaussian mixture models. The experimental results demonstrated that the quality of the separated singing voice is improved for both the unvoiced and voiced parts. Moreover, to deal with the problem of lack of a publicly available dataset for singing voice separation, we have constructed a corpus called MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) where all singing voices and music accompaniments were recorded separately. Each song clip comes with human-labeled pitch values, unvoiced sounds and vocal/non-vocal segments, and lyrics, as well as the speech recording of the lyrics.

引用

页码：310 / 319

页数：10

共 32 条

[1]

[Anonymous], P ANN C INT SPEECH C

[2]

[Anonymous], 2007, P INT S FRONT RES SP

[3]

[Anonymous], 2005, ISMIR

[4]

BERENZWEIG AL, 2002, P AES 22 INT C VIRT

[5]

BROWN GJ, 2006, LISTENING SPEECH AUD, P375

[6] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].

DEMPSTER, AP ;

LAIRD, NM ;

RUBIN, DB .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38

[7]

DRESSLER K, 2006, EXTENDED ABSTRACT IS

[8]

Dressler K., 2006, P 9 INT C DIGITAL AU, P247

[9] Three techniques for improving automatic synchronization between music and lyrics: Fricative detection, filler model, and novel feature vectors for vocal activity detection [J].

Fujihara, Hiromasa ;

Goto, Masataka .

2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :69-72

[10]

Fujihara H, 2006, IEEE INT SYM MULTIM, P257

← 1 2 3 4 →