Convolutional gated recurrent unit networks based real-time monaural speech enhancement

被引：1

作者：

Vanambathina, Sunny Dayal ^{[1
]}

Anumola, Vaishnavi ^{[1
]}

Tejasree, Ponnapalli ^{[1
]}

Divya, R. ^{[1
]}

Manaswini, B. ^{[2
]}

机构：

[1] Vellore Inst Technol, Dept Elect & Commun Engn, Andhra Pradesh VIT AP, Amaravathi 522237, India

[2] Lakireddy Balireddy Coll Engn, Comp Sci & Engn Dept, Myalavaram, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 29期

关键词：

Speech enhancement; Deep learning; Discrete cosine transform; Signal to noise ratio; MASKING; NOISE;

D O I：

10.1007/s11042-023-15639-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep-learning based speech enhancement included many applications like improving speech intelligibility and perceptual quality. There are many methods which focus on amplitude spectrum enhancement. In the existing models, computation of the complex layer is huge which leads to a very big challenge to the device. DFT data is complex valued, so computation is difficult since we need to deal with the both real and imaginary parts of the signal at the same time. To reduce the computation, some researchers use the variants of STFT as input, such as amplitude/energy spectrum, Log-Mel spectrum, etc. They all enhance amplitude spectrum without estimating clean phase, this would limit the enhancement performance. In the proposed method DCT is used which is real-valued transformation without information lost and contains implicit phase. This avoids the problem of manually design a complex network to estimate the explicit phase and it will improve the enhancement performance. More research have done on phase spectrum estimation directly and indirectly, but it is not ideal. Recently, complex valued models are proposed like deep complex convolution recurrent network (DCCRN). The computation of the model is very huge. So a Deep Cosine transform convolutional Gated recurrent Unit (DCTCGRU) is proposed to reduce the complexity and improve further performance. GRU can well model the correlation between adjacent frames of noisy speech. The results from the experiment show that DCTCGRU achieves better results in terms of SNR, PESQ and STOI compared with the state-of-the-art algorithms.

引用

页码：45717 / 45732

页数：16

共 53 条

[1] DISCRETE COSINE TRANSFORM [J].

AHMED, N ;

NATARAJAN, T ;

RAO, KR .

IEEE TRANSACTIONS ON COMPUTERS, 1974, C 23 (01) :90-93

[2] IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].

ALLEN, JB ;

BERKLEY, DA .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950

[3]

Andreas J., 2017, Proceedings of the 18th international society for music information retrieval conference, P23

[4]

Chen D, 2021, IEEE T NEUR NET LEAR, P12

[5]

Choi H.-S., 2018, INT C LEARN REPR

[6] Features for Masking-Based Monaural Speech Separation in Reverberant Conditions [J].

Delfarah, Masood ;

Wang, DeLiang .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) :1085-1094

[7]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

[8]

Garofolo J.S., 1993, DARPA TIMIT ACOUSTIC, V93

[9]

Geng C, 2020, P 2020 IEEE INT C AR, P379, DOI DOI 10.1109/ICAICA50127.2020.9182513

[10]

Hao X, 2020, IEEE INT C AC SPEECH, P312

← 1 2 3 4 5 6 →