Reverberant speech separation with probabilistic time-frequency masking for B-format recordings

被引：25

作者：

Chen, Xiaoyi ^{[1
]}

Wang, Wenwu ^{[2
]}

Wang, Yingmin ^{[1
]}

Zhong, Xionghu ^{[3
]}

Alinaghi, Atiyeh ^{[2
]}

机构：

[1] Northwestern Polytech Univ, Sch Marine Sci & Technol, Dept Acoust Engn, Xian 710072, Peoples R China

[2] Univ Surrey, Dept Elect Engn, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, Surrey, England

[3] Nanyang Technol Univ, Coll Engn, Sch Comp Engn, Singapore 639798, Singapore

来源：

SPEECH COMMUNICATION | 2015年 / 68卷

关键词：

B-format signal; Acoustic intensity; Expectation-maximization (EM) algorithm; Blind source separation (BSS); Direction of arrival (DOA); BLIND SOURCE SEPARATION; INDEPENDENT COMPONENT ANALYSIS; OF-ARRIVAL ESTIMATION; CONVOLUTIVE MIXTURES; ALGORITHMS; ROBUST; ICA;

D O I：

10.1016/j.specom.2015.01.002

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Existing speech source separation approaches overwhelmingly rely on acoustic pressure information acquired by using a microphone array. Little attention has been devoted to the usage of B-format microphones, by which both acoustic pressure and pressure gradient can be obtained, and therefore the direction of arrival (DOA) cues can be estimated from the received signal. In this paper, such DOA cues, together with the frequency bin-wise mixing vector (MV) cues, are used to evaluate the contribution of a specific source at each time frequency (T-F) point of the mixtures in order to separate the source from the mixture. Based on the von Mises mixture model and the complex Gaussian mixture model respectively, a source separation algorithm is developed, where the model parameters are estimated via an expectation-maximization (EM) algorithm. A T-F mask is then derived from the model parameters for recovering the sources. Moreover, we further improve the separation performance by choosing only the reliable DOA estimates at the T-F units based on thresholding. The performance of the proposed method is evaluated in both simulated room environments and a real reverberant studio in terms of signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ). The experimental results show its advantage over four baseline algorithms including three T-F mask based approaches and one convolutive independent component analysis (ICA) based method. (C) 2015 Elsevier B.V. All rights reserved.

引用

页码：41 / 54

页数：14

共 50 条

[1] ACOUSTIC VECTOR SENSOR BASED REVERBERANT SPEECH SEPARATION WITH PROBABILISTIC TIME-FREQUENCY MASKING
Zhong, Xionghu
Chen, Xiaoyi
Wang, Wenwu
Alinaghi, Atiyeh
Premkumar, A. B.
2013 PROCEEDINGS OF THE 21ST EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2013,
[2] SPATIAL AND COHERENCE CUES BASED TIME-FREQUENCY MASKING FOR BINAURAL REVERBERANT SPEECH SEPARATION
Alinaghi, Atiyeh
Wang, Wenwu
Jackson, Philip J. B.
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 684 - 688
[3] Separation of multiple speech sources by recovering sparse and non-sparse components from B-format microphone recordings
Jia, Maoshen
Sun, Jundai
Bao, Changchun
Ritz, Christian
SPEECH COMMUNICATION, 2018, 96 : 184 - 196
[4] Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking
Liu, Qingju
Wang, Wenwu
Jackson, Philip J. B.
Barnard, Mark
Kittler, Josef
Chambers, Jonathon
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2013, 61 (22) : 5520 - 5535
[5] Multi-Channel Bin-Wise Speech Separation Combining Time-Frequency Masking and Beamforming
Bella, Mostafa
Saylani, Hicham
Hosseini, Shahram
Deville, Yannick
IEEE ACCESS, 2023, 11 : 100632 - 100645
[6] Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks
Yu, Yang
Wang, Wenwu
Han, Peng
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2016,
[7] Determined BSS Based on Time-Frequency Masking and Its Application to Harmonic Vector Analysis
Yatabe, Kohei
Kitamura, Daichi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1609 - 1625
[8] Online blind speech separation using multiple acoustic speaker tracking and time-frequency masking
Pertila, P.
COMPUTER SPEECH AND LANGUAGE, 2013, 27 (03) : 683 - 702
[9] Stereo-input Speech Recognition using Sparseness-based Time-frequency Masking in a Reverberant Environment
Izumi, Yosuke
Nishiki, Kenta
Watanabe, Shinji
Nishimoto, Takuya
Ono, Nobutaka
Sagayama, Shigeki
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1907 - +
[10] Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
Luo, Yi
Mesgarani, Nima
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (08) : 1256 - 1266

← 1 2 3 4 5 →