ATT:Adversarial Trained Transformer for Speech Enhancement

被引：0

作者：

Aitawade, Aniket ^{[1
]}

Bharati, Puja ^{[1
]}

Chandra, Sabyasachi ^{[1
]}

Prasad, G. Satya ^{[1
]}

Pramanik, Debolina ^{[1
]}

Khadse, Parth Sanjay ^{[1
]}

Das Mandal, Shyamal Kumar ^{[1
]}

机构：

[1] Indian Inst Technol Kharagpur, Speech Proc Lab, Kharagpur, W Bengal, India

来源：

SPEECH AND COMPUTER, SPECOM 2023, PT I | 2023年 / 14338卷

关键词：

Adversarial trained transformer; Generative adversarial etwork; Speech enhancement; Transformer;

D O I：

10.1007/978-3-031-48309-7_22

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech enhancement is crucial in various applications where background noise or interference affects the quality of speech signals. Traditional signal processing techniques have limitations in handling complex and non-stationary noise sources, leading to sub-optimal performance in real-world scenarios. In recent years, research has been increasing on machine learning and deep learning-based algorithms for speech enhancement. This paper presents a novel method called Adversarial Trained Transformer (ATT) for speech enhancement. The generator component of ATT is based on a transformer architecture, which utilizes multi-head attention and LSTM-embedder layers to capture temporal dependencies and local structure in the speech signals. The discriminator component, on the other hand, employs convolutional layers for binary classification. The effectiveness of the ATT method is demonstrated through experiments conducted on the VoiceBank+DEMAND dataset. The outcomes reveal noteworthy enhancements in different speech quality metrics in comparison to other baseline methods. Furthermore, the ATT model achieves superior performance while utilizing a comparatively smaller number of parameters when compared to alternative models. The generator of ATT model has 6.57 million parameters while the generator of SEGAN has 74.13 million parameters. The results show that the suggested ATT architecture has a lot of promise for improving speech quality, and it presents a viable strategy that combines the benefits of transformer-based models with adversarial training.

引用

页码：258 / 270

页数：13

共 18 条

[1] [Anonymous], 2005, Rec, I.: P. 862.2: Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs
[2] Evaluation of objective quality measures for speech enhancement
Hu, Yi
Loizou, Philipos C.
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01): : 229 - 238
[3] Image-to-Image Translation with Conditional Adversarial Networks
Isola, Phillip
Zhu, Jun-Yan
Zhou, Tinghui
Efros, Alexei A.
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5967 - 5976
[4] Loizou P. C., 2013, Speech Enhancement: Theory and Practice, DOI 10.1201/b14529
[5] Pandey A, 2019, INT CONF ACOUST SPEE, P6875, DOI [10.1109/ICASSP.2019.8683634, 10.1109/icassp.2019.8683634]
[6] Pandey A, 2019, IEEE-ACM T AUDIO SPE, V27, P1179, DOI [10.1109/TASLP.2019.2913512, 10.1109/taslp.2019.2913512]
[7] Park Se Rim, 2016, arXiv, DOI DOI 10.48550/ARXIV.1609.07132
[8] Pascual S, 2017, Arxiv, DOI arXiv:1703.09452
[9] Context Encoders: Feature Learning by Inpainting
Pathak, Deepak
Krahenbuhl, Philipp
Donahue, Jeff
Darrell, Trevor
Efros, Alexei A.
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2536 - 2544
[10] Shahriyar S. A., 2019, 2019 INT C EL COMP C, P1, DOI [10.1109/ECACE.2019.8679106, DOI 10.1109/ECACE.2019.8679106]

← 1 2 →