MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment

被引:3
作者
El Hajal, Karl [1 ,2 ]
Cernak, Milos [1 ]
Mainar, Pablo [1 ]
机构
[1] Logitech Europe SA, Lausanne, Switzerland
[2] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
来源
INTERSPEECH 2022 | 2022年
关键词
Speech quality assessment; joint learning; room acoustics; BAND;
D O I
10.21437/Interspeech.2022-10698
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The acoustic environment can degrade speech quality during communication (e.g., video call, remote presentation, outside voice recording), and its impact is often unknown. Objective metrics for speech quality have proven challenging to develop given the multi-dimensionality of factors that affect speech quality and the difficulty of collecting labeled data. Hypothesizing the impact of acoustics on speech quality, this paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric that can predict room acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean opinion score (MOS) for speech quality. By explicitly optimizing the model to learn these room acoustics parameters, we can extract more informative features and improve the generalization for the MOS task when the training data is limited. Furthermore, we also show that this joint training method enhances the blind estimation of room acoustics, improving the performance of current state-of-the-art models. An additional side-effect of this joint prediction is the improvement in the explainability of the predictions, which is a valuable feature for many applications.
引用
收藏
页码:3313 / 3317
页数:5
相关论文
共 26 条
[11]  
Dong X, 2020, INT CONF ACOUST SPEE, P911, DOI [10.1109/icassp40776.2020.9053366, 10.1109/ICASSP40776.2020.9053366]
[12]   Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM [J].
Fu, Szu-Wei ;
Tsao, Yu ;
Hwang, Hsin-Te ;
Wang, Hsin-Min .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1873-1877
[13]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[14]   A UNIVERSAL DEEP ROOM ACOUSTICS ESTIMATOR [J].
Lopez, Paula Sanchez ;
Callens, Paul ;
Cernak, Milos .
2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, :356-360
[15]   N-MTTL SI Model: Non-intrusive Multi-task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification [J].
Marcinek, Lubos ;
Stone, Michael ;
Millman, Rebecca ;
Gaydecki, Patrick .
INTERSPEECH 2021, 2021, :3365-3369
[16]  
Mittag G., 2020, P INT 2020
[17]   NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets [J].
Mittag, Gabriel ;
Naderi, Babak ;
Chehadi, Assmaa ;
Moeller, Sebastian .
INTERSPEECH 2021, 2021, :2127-2131
[18]  
Mittag G, 2019, INT CONF ACOUST SPEE, P7125, DOI [10.1109/ICASSP.2019.8683770, 10.1109/icassp.2019.8683770]
[19]   Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-specific Scaling [J].
Nessler, Natalia ;
Cernak, Milos ;
Prandoni, Paolo ;
Mainar, Pablo .
INTERSPEECH 2021, 2021, :2406-2410
[20]  
Recommendation I.-T., 2001, ITU-T Recommendation P.862