Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network

被引：2

作者：

Shu, Xiaofeng ^{[1
]}

Chen, Yanjie ^{[1
]}

Shang, Chuxiang ^{[1
]}

Zhao, Yan ^{[1
]}

Zhao, Chengshuai ^{[1
]}

Zhu, Yehang ^{[1
]}

Huang, Chuanzeng ^{[1
]}

Wang, Yuxuan ^{[1
]}

机构：

[1] ByteDance, Speech Audio & Mus Intelligence SAMI Grp, Beijing, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

Speech quality assessment; MOS; VID; multi-task learning; SAA-TCN; PCC;

D O I：

10.21437/Interspeech.2022-10315

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In terms of subjective evaluations, speech quality has been generally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overfitting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In Conferencing-Speech 2022 Challenge, our method achieves a mean Pearson's correlation coefficient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.

引用

页码：3298 / 3302

页数：5

共 21 条

[1]

Bai S., 2018, CoRR abs/1803.01271

[2]

Beerends JG, 2013, J AUDIO ENG SOC, V61, P366

[3]

Cooper Erica, 2021, ARXIV211002635

[4] Towards real-world objective speech quality and intelligibility assessment using speech-enhancement residuals and convolutional long short-term memory networks [J].

Dong, Xuan ;

Williamson, Donald S. .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2020, 148 (05) :3348-3359

[5]

Dong X, 2020, INT CONF ACOUST SPEE, P911, DOI [10.1109/icassp40776.2020.9053366, 10.1109/ICASSP40776.2020.9053366]

[6] Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM [J].

Fu, Szu-Wei ;

Tsao, Yu ;

Hwang, Hsin-Te ;

Wang, Hsin-Min .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1873-1877

[7]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/CVPR.2018.00745, 10.1109/TPAMI.2019.2913372]

[8] MBNET: MOS PREDICTION FOR SYNTHESIZED SPEECH WITH MEAN-BIAS NETWORK [J].

Leng, Yichong ;

Tan, Xu ;

Zhao, Sheng ;

Soong, Frank ;

Li, Xiang-Yang ;

Qin, Tao .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :391-395

[9] ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK [J].

Li, Andong ;

Liu, Wenzhe ;

Luo, Xiaoxue ;

Zheng, Chengshi ;

Li, Xiaodong .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6628-6632

[10]

Liu H., 2021, ARXIV211204685

← 1 2 3 →