Jointly Optimizing Activation Coefficients of Convolutive NMF Using DNN for Speech Separation

被引：10

作者：

Li, Hao ^{[1
]}

Nie, Shuai ^{[2
]}

Zhang, Xueliang ^{[1
]}

Zhang, Hui ^{[1
]}

机构：

[1] Inner Mongolia Univ, Coll Comp Sci, Hohhot, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Natl Lab Patten Recognit, Beijing, Peoples R China

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

speech separation; Convolutive non-negative matrix factorization (CNMF); Deep neural networks (DNN);

D O I：

10.21437/Interspeech.2016-120

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Convolutive non-negative matrix factorization (CNMF) and deep neural networks (DNN) are two efficient methods for monaural speech separation. Conventional DNN focuses on building the non-linear relationship between mixture and target speech. However, it ignores the prominent structure of the target speech. Conventional CNMF model concentrates on capturing prominent harmonic structures and temporal continuities of speech but it ignores the non-linear relationship between the mixture and target. Taking these two aspects into consideration at the same time may result in better performance. In this paper, we propose a joint optimization of DNN models with an extra CNMF layer for speech separation task. We also utilize an extra masking layer on the proposed model to constrain the speech reconstruction. Moreover, a discriminative training criterion is proposed to further enhance the performance of the separation. Experimental results show that the proposed model has significant improvement in PESQ, SAR, SIR and SDR compared with conventional methods.

引用

页码：550 / 554

页数：5

共 17 条

[1] Single-Channel Speech-Music Separation for Robust ASR With Mixture Models [J].

Demir, Cemil ;

Saraclar, Murat ;

Cemgil, Ali Taylan .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (04) :725-736

[2]

Garofolo J., 1993, NASA STI RECON TECH, V93, P1

[3]

Glorot X., 2011, P 14 INT C ARTIFICIA, P315

[4]

Huang P.-S., 2014, IEEEINT C ACOUST SPE, P1562, DOI DOI 10.1109/ICASSP.2014.6853860

[5] An algorithm that improves speech intelligibility in noise for normal-hearing listeners [J].

Kim, Gibak ;

Lu, Yang ;

Hu, Yi ;

Loizou, Philipos C. .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2009, 126 (03) :1486-1494

[6] A New Bayesian Method Incorporating With Local Correlation for IBM Estimation [J].

Liang, Shan ;

Liu, Wenju ;

Jiang, Wei .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (03) :476-487

[7] ON THE LIMITED MEMORY BFGS METHOD FOR LARGE-SCALE OPTIMIZATION [J].

LIU, DC ;

NOCEDAL, J .

MATHEMATICAL PROGRAMMING, 1989, 45 (03) :503-528

[8]

Narayanan A, 2013, INT CONF ACOUST SPEE, P7092, DOI 10.1109/ICASSP.2013.6639038

[9] Discovering speech phones using convolutive non-negative matrix factorisation with a sparseness constraint [J].

O'Grady, Paul D. ;

Pearlmutter, Barak A. .

NEUROCOMPUTING, 2008, 72 (1-3) :88-101

[10] Soft mask methods for single-channel speaker separation [J].

Reddy, Aarthi M. ;

Raj, Bhiksha .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06) :1766-1776

← 1 2 →