Full single-type deep learning models with multihead attention for speech enhancement

被引:2
作者
Zacarias-Morales, Noel [1 ]
Hernandez-Nolasco, Jose Adan [1 ]
Pancardo, Pablo [1 ]
机构
[1] Juarez Autonomous Univ Tabasco, Acad Div Sci & Informat Technol, Cunduacan 86690, Tabasco, Mexico
关键词
Artificial neural network; Attention; Deep learning models; Speech enhancement; SELF-ATTENTION; ALGORITHM; NOISE;
D O I
10.1007/s10489-023-04571-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Artificial neural network (ANN) models with attention mechanisms for eliminating noise in audio signals, called speech enhancement models, have proven effective. However, their architectures become complex, deep, and demanding in terms of computational resources when trying to achieve higher levels of efficiency. Given this situation, we selected and evaluated simple and less resource-demanding models and utilized the same training parameters and performance metrics to conduct a fair comparison among the four selected models. Our purpose was to demonstrate that simple neural network models with multihead attention are efficient when implemented on computational devices with conventional resources since they provide results that are competitive with those of hybrid, complex and resource-demanding models. We experimentally evaluated the efficiency of multilayer perceptron (MLP), one-dimensional and two-dimensional convolutional neural network (CNN), and gated recurrent unit (GRU) deep learning models with and without multiheaded attention. We also analyzed the generalization capability of each model. The results showed that although these architectures were composed of only one type of ANN, multihead attention increased the efficiency of the speech enhancement process, yielding results that were competitive with those of complex models. Therefore, this study is helpful as a reference for building simple and efficient single-type ANN models with attention.
引用
收藏
页码:20561 / 20576
页数:16
相关论文
共 33 条
[1]  
[Anonymous], 1993, Tech. Rep. LDC93S1
[2]  
Boyle P.J., 2020, The Human Auditory System-Basic Features and Updates on Audiological Diagnosis and Therapy, DOI DOI 10.5772/INTECHOPEN.77713
[3]   A General Survey on Attention Mechanisms in Deep Learning [J].
Brauwers, Gianni ;
Frasincar, Flavius .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (04) :3279-3298
[4]   Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition [J].
Fan, Cunhang ;
Yi, Jiangyan ;
Tao, Jianhua ;
Tian, Zhengkun ;
Liu, Bin ;
Wen, Zhengqi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :198-209
[5]   Attention in Natural Language Processing [J].
Galassi, Andrea ;
Lippi, Marco ;
Torroni, Paolo .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (10) :4291-4308
[6]   A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation [J].
Hu, Guoning ;
Wang, DeLiang .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (08) :2067-2079
[7]   An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers [J].
Jensen, Jesper ;
Taal, Cees H. .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) :2009-2022
[8]  
Kamath Uday, 2022, Transformers for Machine Learning: A Deep Dive, V1, DOI [10.1201/9781003170082, DOI 10.1201/9781003170082]
[9]  
Kim J, 2020, INT CONF ACOUST SPEE, P6649, DOI [10.1109/icassp40776.2020.9053591, 10.1109/ICASSP40776.2020.9053591]
[10]  
Koizumi Y, 2020, INT CONF ACOUST SPEE, P181, DOI [10.1109/icassp40776.2020.9053214, 10.1109/ICASSP40776.2020.9053214]