Towards real-world objective speech quality and intelligibility assessment using speech-enhancement residuals and convolutional long short-term memory networks

被引：5

作者：

Dong, Xuan ^{[1
]}

Williamson, Donald S. ^{[1
]}

机构：

[1] Indiana Univ, Dept Comp Sci, Bloomington, IN 47408 USA

来源：

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA | 2020年 / 148卷 / 05期

关键词：

NONINTRUSIVE QUALITY; NOISE; PREDICTION; FEATURES; MASKING;

D O I：

10.1121/10.0002702

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Objective metrics, such as the perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and signal-to-distortion ratio (SDR), are often used for evaluating speech. These metrics are intrusive since they require a reference (clean) speech signal to complete the evaluation. The need for a reference signal reduces the practicality of these metrics, since a clean reference signal is not typically available during real-world testing. In this paper, a two-stage approach is presented that estimates the objective score of these intrusive metrics in a non-intrusive manner, which enables testing in real-world environments. More specifically, objective score estimation is treated as a machine-learning problem, and the use of speech-enhancement residuals and convolutional long short-term memory (SER-CL) networks is proposed to blindly estimate the objective scores (i.e., PESQ, STOI, and SDR) of various speech signals. The approach is evaluated in simulated and real environments that contain different combinations of noise and reverberation. The results reveal that the proposed approach is a reasonable alternative for evaluating speech, where it performs well in terms of accuracy and correlation. The proposed approach also outperforms comparison approaches in several environments.

引用

页码：3348 / 3359

页数：12

共 51 条

[1]

American Academy of Pediatrics Committee on Infectious Diseases, 2007, PEDIATRICS, V12, P221, DOI DOI 10.1002/BLTJ.20228

[2] Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks [J].

Andersen, Asger Heidemann ;

de Haan, Jan Mark ;

Tan, Zheng-Hua ;

Jensen, Jesper .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) :1925-1939

[3]

[Anonymous], 2006, ROOM IMPLUSE RESPONS

[4]

[Anonymous], TIMIT ACOUSTIC PHONE

[5] A Priori SNR Estimation Using a Generalized Decision Directed Approach [J].

Chinaev, Aleksej ;

Haeb-Umbach, Reinhold .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3758-3762

[6] Long-Term SNR Estimation Using Noise Residuals and a Two-Stage Deep-Learning Framework [J].

Dong, Xuan ;

Williamson, Donald S. .

LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2018), 2018, 10891 :351-360

[7]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

[8] Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors [J].

Erkelens, Jan S. ;

Hendriks, Richard C. ;

Heusdens, Richard ;

Jensen, Jesper .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06) :1741-1752

[9]

Fakoor R, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P3011, DOI 10.1109/ICASSP.2018.8462042

[10] Single-ended speech quality measurement using machine learning methods [J].

Falk, Tiago H. ;

Chan, Wai-Yip .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (06) :1935-1947

← 1 2 3 4 5 6 →