ATT:Adversarial Trained Transformer for Speech Enhancement

被引:0
作者
Aitawade, Aniket [1 ]
Bharati, Puja [1 ]
Chandra, Sabyasachi [1 ]
Prasad, G. Satya [1 ]
Pramanik, Debolina [1 ]
Khadse, Parth Sanjay [1 ]
Das Mandal, Shyamal Kumar [1 ]
机构
[1] Indian Inst Technol Kharagpur, Speech Proc Lab, Kharagpur, W Bengal, India
来源
SPEECH AND COMPUTER, SPECOM 2023, PT I | 2023年 / 14338卷
关键词
Adversarial trained transformer; Generative adversarial etwork; Speech enhancement; Transformer;
D O I
10.1007/978-3-031-48309-7_22
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech enhancement is crucial in various applications where background noise or interference affects the quality of speech signals. Traditional signal processing techniques have limitations in handling complex and non-stationary noise sources, leading to sub-optimal performance in real-world scenarios. In recent years, research has been increasing on machine learning and deep learning-based algorithms for speech enhancement. This paper presents a novel method called Adversarial Trained Transformer (ATT) for speech enhancement. The generator component of ATT is based on a transformer architecture, which utilizes multi-head attention and LSTM-embedder layers to capture temporal dependencies and local structure in the speech signals. The discriminator component, on the other hand, employs convolutional layers for binary classification. The effectiveness of the ATT method is demonstrated through experiments conducted on the VoiceBank+DEMAND dataset. The outcomes reveal noteworthy enhancements in different speech quality metrics in comparison to other baseline methods. Furthermore, the ATT model achieves superior performance while utilizing a comparatively smaller number of parameters when compared to alternative models. The generator of ATT model has 6.57 million parameters while the generator of SEGAN has 74.13 million parameters. The results show that the suggested ATT architecture has a lot of promise for improving speech quality, and it presents a viable strategy that combines the benefits of transformer-based models with adversarial training.
引用
收藏
页码:258 / 270
页数:13
相关论文
共 18 条
  • [1] [Anonymous], 2005, Rec, I.: P. 862.2: Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs
  • [2] Evaluation of objective quality measures for speech enhancement
    Hu, Yi
    Loizou, Philipos C.
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01): : 229 - 238
  • [3] Image-to-Image Translation with Conditional Adversarial Networks
    Isola, Phillip
    Zhu, Jun-Yan
    Zhou, Tinghui
    Efros, Alexei A.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5967 - 5976
  • [4] Loizou P. C., 2013, Speech Enhancement: Theory and Practice, DOI 10.1201/b14529
  • [5] Pandey A, 2019, INT CONF ACOUST SPEE, P6875, DOI [10.1109/ICASSP.2019.8683634, 10.1109/icassp.2019.8683634]
  • [6] Pandey A, 2019, IEEE-ACM T AUDIO SPE, V27, P1179, DOI [10.1109/TASLP.2019.2913512, 10.1109/taslp.2019.2913512]
  • [7] Park Se Rim, 2016, arXiv, DOI DOI 10.48550/ARXIV.1609.07132
  • [8] Pascual S, 2017, Arxiv, DOI arXiv:1703.09452
  • [9] Context Encoders: Feature Learning by Inpainting
    Pathak, Deepak
    Krahenbuhl, Philipp
    Donahue, Jeff
    Darrell, Trevor
    Efros, Alexei A.
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2536 - 2544
  • [10] Shahriyar S. A., 2019, 2019 INT C EL COMP C, P1, DOI [10.1109/ECACE.2019.8679106, DOI 10.1109/ECACE.2019.8679106]