eSTImate: A Real-time Speech Transmission Index EstimatorWith Speech Enhancement Auxiliary Task Using Self-Attention Feature Pyramid Network

被引:1
作者
Xiang, Bajian [1 ]
Liu, Hongkun [1 ]
Wu, Zedong [1 ]
Shen, Su [1 ]
Zhang, Xiangdong [1 ]
机构
[1] Youson Technol, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
speech transmission index estimation; speech enhancement; deep neural networks; auxiliary learning; QUALITY ASSESSMENT; INTELLIGIBILITY;
D O I
10.21437/Interspeech.2023-727
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The Speech Transmission Index (STI) is a crucial metric for evaluating speech intelligibility, but its standard measurement method is too complicated for real-time applications. Though recently proposed deep learning based STI estimation schemes can effectively address the problem, existing methods still fall short of covering all possible STI scenarios. This paper presents eSTImate: an end-to-end deep learning system for real-time STI blind estimation that integrates the tasks of STI estimation and speech enhancement through a feature pyramid auxiliary learning architecture and incorporates multi-head attention mechanisms. The proposed model demonstrates the performance of state-of-the-art, achieving a low mean absolute error of 0.016 and root mean square error of 0.021 on the constructed dataset that covers the whole range of STI, highlighting its potential to provide accurate and consistent real-time STI estimation across diverse real-world scenarios.
引用
收藏
页码:2848 / 2852
页数:5
相关论文
共 23 条
[1]   Blind estimation of speech transmission index and room acoustic parameters based on the extended model of room impulse response [J].
Duangpummet, Suradej ;
Karnjana, Jessada ;
Kongprawechnon, Waree ;
Unoki, Masashi .
APPLIED ACOUSTICS, 2022, 185
[2]  
Duangpummet S, 2019, ASIAPAC SIGN INFO PR, P1208, DOI 10.1109/APSIPAASC47483.2019.9023209
[3]   FACTORS GOVERNING THE INTELLIGIBILITY OF SPEECH SOUNDS [J].
FRENCH, NR ;
STEINBERG, JC .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1947, 19 (01) :90-119
[4]  
Garofolo J. S., 1993, NASA STI/Recon Technical Report N, V93, P27403
[5]  
Garofolo J. S., 1993, TIMIT ACOUSTIC PHONE
[6]  
Houtgast T., 1973, J ACOUST SOC AM, V54, P557, DOI DOI 10.1121/1.1913632
[7]  
IEC, 2020, 60268162020 IEC, V5
[8]   An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers [J].
Jensen, Jesper ;
Taal, Cees H. .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) :2009-2022
[9]  
Liebel Lukas, 2018, ARXIV180506334
[10]  
Lin T. Y., 2017, P IEEE C COMPUTER VI, P936, DOI [10.1109/CVPR.2017.106, https://doi.org/10.48550/arXiv.1612.03144, DOI 10.1109/CVPR.2017.106]