Embedding Encoder-Decoder With Attention Mechanism for Monaural Speech Enhancement

被引：5

作者：

Lan, Tian ^{[1
,2
]}

Ye, Wenzheng ^{[1
]}

Lyu, Yilan ^{[1
]}

Zhang, Junyi ^{[2
]}

Liu, Qiao ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu 610051, Peoples R China

[2] Electromagnet Spectrum Cognit & Management Key La, Shijiazhuang 050081, Hebei, Peoples R China

来源：

IEEE ACCESS | 2020年 / 8卷

关键词：

Speech enhancement; Decoding; Noise measurement; Spectrogram; Signal to noise ratio; Feature extraction; Convolution; embedding encoder-decoder; convolutional neural network; attention mechanism; neural network; NOISE; SUPPRESSION;

D O I：

10.1109/ACCESS.2020.2995346

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The auditory selection framework with attention and memory (ASAM), which has an attention mechanism, embedding generator, generated embedding array, and life-long memory, is used to deal with mixed speech. When ASAM is applied to speech enhancement, the discrepancy between the voice and noise feature memories is huge and the separability of noise and voice is increased. However, ASAM cannot achieve desirable performance in terms of speech enhancement because it fails to utilize the time-frequency dependence of the embedding vectors to generate a corresponding mask unit. This work proposes a novel embedding encoder-decoder (EED), and a convolutional neural network (CNN) is used as decoder. The CNN structure is good at detecting local patterns, which can be exploited to extract correlation embedding data from the embedding array to generate the target spectrogram. This work evaluates a similar ASAM, EED with an LSTM encoder and a CNN decoder (RC-EED), RC-EED with an attention mechanism (RC-AEED), other similar EED structures and baseline models. Experiment results show that RC-EED and RC-AEED networks have good performance on speech enhancement task at low signal-to-noise ratio conditions. In addition, RC-AEED exhibits superior speech enhancement performance over ASAM and achieves better speech quality than do deep recurrent network and convolutional recurrent network.

引用

页码：96677 / 96685

页数：9

共 27 条

[1] Representation Learning: A Review and New Perspectives [J].

Bengio, Yoshua ;

Courville, Aaron ;

Vincent, Pascal .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828

[2] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].

BOLL, SF .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120

[3] Tuning In to Sound: Frequency-Selective Attentional Filter in Human Primary Auditory Cortex [J].

Da Costa, Sandra ;

van der Zwaag, Wietske ;

Miller, Lee M. ;

Clarke, Stephanie ;

Saenz, Melissa .

JOURNAL OF NEUROSCIENCE, 2013, 33 (05) :1858-1863

[4] Front-end speech enhancement for commercial speaker verification systems [J].

Eskimez, Sefik Emre ;

Soufleris, Peter ;

Duan, Zhiyao ;

Heinzelman, Wendi .

SPEECH COMMUNICATION, 2018, 99 :101-113

[5] Environment-dependent Attention-driven Recurrent Convolutional Neural Network for Robust Speech Enhancement [J].

Ge, Meng ;

Wang, Longbiao ;

Li, Nan ;

Shi, Hao ;

Dang, Jianwu ;

Li, Xiangang .

INTERSPEECH 2019, 2019, :3153-3157

[6]

Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631

[7]

Huang B., 2019, ARXIV190906276

[8]

Hummersone C, 2014, SIGNALS COMMUN TECHN, P349, DOI 10.1007/978-3-642-55016-4_12

[9]

Le Roux J, 2019, INT CONF ACOUST SPEE, P626, DOI 10.1109/ICASSP.2019.8683855

[10]

Long J, 2015, PROC CVPR IEEE, P3431, DOI 10.1109/CVPR.2015.7298965

← 1 2 3 →