Exploring Multi-Stage GAN with Self-Attention for Speech Enhancement

被引：1

作者：

Asiedu Asante, Bismark Kweku ^{[1
]}

Broni-Bediako, Clifford ^{[2
]}

Imamura, Hiroki ^{[1
]}

机构：

[1] Soka Univ, Grad Sch Sci & Engn, Hachioji 1928577, Japan

[2] RIKEN Ctr Adv Intelligence Project, Chuo City 1030027, Japan

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 16期

关键词：

speech enhancement; generative adversarial network (GAN); multi-stage GAN; multi-generator GAN; self-attention mechanism;

D O I：

10.3390/app13169217

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Multi-stage or multi-generator generative adversarial networks (GANs) have recently been demonstrated to be effective for speech enhancement. The existing multi-generator GANs for speech enhancement only use convolutional layers for synthesising clean speech signals. This reliance on convolution operation may result in masking the temporal dependencies within the signal sequence. This study explores self-attention to address the temporal dependency issue in multi-generator speech enhancement GANs to improve their enhancement performance. We empirically study the effect of integrating a self-attention mechanism into the convolutional layers of the multiple generators in multi-stage or multi-generator speech enhancement GANs, specifically, the ISEGAN and the DSEGAN networks. The experimental results show that introducing a self-attention mechanism into ISEGAN and DSEGAN leads to improvements in their speech enhancement quality and intelligibility across the objective evaluation metrics. Furthermore, we observe that adding self-attention to the ISEGAN's generators does not only improves its enhancement performance but also bridges the performance gap between the ISEGAN and the DSEGAN with a smaller model footprint. Overall, our findings highlight the potential of self-attention in improving the enhancement performance of multi-generator speech enhancement GANs.

引用

页数：16

共 56 条

[1]

Abadi M, 2016, ACM SIGPLAN NOTICES, V51, P1, DOI [10.1145/3022670.2976746, 10.1145/2951913.2976746]

[2]

Baby D, 2019, INT CONF ACOUST SPEE, P106, DOI [10.1109/ICASSP.2019.8683799, 10.1109/icassp.2019.8683799]

[3]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[4] Survey of Deep Learning Paradigms for Speech Processing [J].

Bhangale, Kishor Barasu ;

Kothandaraman, Mohanaprasad .

WIRELESS PERSONAL COMMUNICATIONS, 2022, 125 (02) :1913-1949

[5]

Botinhao C. V., 2016, SSW, P159

[6]

Bulut AE, 2020, INT CONF ACOUST SPEE, P6214, DOI [10.1109/ICASSP40776.2020.9054563, 10.1109/icassp40776.2020.9054563]

[7]

Defossez A., 2020, arXiv

[8]

Donahue C, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5024, DOI 10.1109/ICASSP.2018.8462581

[9] VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENT WITH A NOISE-AWARE ENCODER [J].

Fang, Huajian ;

Carbajal, Guillaume ;

Wermter, Stefan ;

Gerkmann, Timo .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :676-680

[10]

Fedorov I, 2020, Arxiv, DOI arXiv:2005.11138

← 1 2 3 4 5 6 →