Training Data-Driven Speech Intelligibility Predictors on Heterogeneous Listening Test Data

被引：3

作者：

Pedersen, Mathias Bach ^{[1
]}

Andersen, Asger H. ^{[2
,3
]}

Jensen, Soren Holdt ^{[4
]}

Tan, Zheng-Hua ^{[1
]}

Jensen, Jesper ^{[1
,2
]}

机构：

[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark

[2] Demant AS, DK-2765 Smorum, Denmark

[3] WS Audiol AS, DK-3540 Lynge, Denmark

[4] Danish Minist Def Estate Agcy, DK-9800 Hjorring, Denmark

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Training; Indexes; Training data; Predictive models; Speech processing; Convolutional neural networks; Hidden Markov models; Neural networks; psychometric functions; speech intelligibility prediction; NOISE; ENHANCEMENT; INFORMATION; PERCEPTION; QUALITY;

D O I：

10.1109/ACCESS.2022.3184785

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Prediction of Speech Intelligibility (SI) is a topic of interest for most speech processing applications, where intelligibility is of any importance, e.g., speech coding, transmission and enhancement. Traditionally, SI predictors have been based on signal processing methods and heuristics, but more recently, an increasing number of data-driven SI-predictors have been proposed. Data-driven prediction of SI requires large quantities of labelled data, ideally from many listening tests. Listening tests differ in factors such as vocabulary, talker, listener's task, etc. collectively referred to as the paradigm. A naive strategy of training SI-predictors directly on stimuli, pooled from different listening tests, is futile because the exact map from the stimulus to SI is determined, not only by the stimulus, but also by the paradigm. Data-driven SI-predictors trained in this way become specialized to the paradigms of the training data by erroneously attributing all paradigm influences on SI to the stimulus. The problem is fundamental and persists even in the idealized situation where training data is abundant. We propose a strategy for training data-driven SI-predictors that is independent of the paradigms, underlying the training data. The proposed strategy is to concatenate an SI-predictor and a layer of trainable dataset-specific mapping functions, each corresponding to a single paradigm in the training data. These mapping functions are trained jointly with the SI-predictor and serve to efficiently approximate the psychometric functions implied by each paradigm. The mapping functions prevent the predictor from specializing to these paradigms during training. We present an SI-predictor with a novel architecture that incorporates a convolutional network and an ESTOI back-end, train it with this strategy, compare it to naive training and a range of existing non-data-driven predictors. The proposed training strategy and architecture results in higher performance overall and increased robustness to unseen paradigms.

引用

页码：66175 / 66189

页数：15

共 41 条

[11] HOUTGAST T, 1971, ACUSTICA, V25, P355
[12] An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers
Jensen, Jesper
Taal, Cees H.
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 2009 - 2022
[13] Speech Intelligibility Prediction Based on Mutual Information
Jensen, Jesper
Taal, Cees H.
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (02) : 430 - 440
[14] Karbasi M, 2016, INT CONF ACOUST SPEE, P624, DOI 10.1109/ICASSP.2016.7471750
[15] The Hearing-Aid Speech Perception Index (HASPI)
Kates, James M.
Arehart, Kathryn H.
[J]. SPEECH COMMUNICATION, 2014, 65 : 75 - 93
[16] Kingma DP, 2014, ADV NEUR IN, V27
[17] Role of mask pattern in intelligibility of ideal binary-masked noisy speech
Kjems, Ulrik
Boldt, Jesper B.
Pedersen, Michael S.
Lunner, Thomas
Wang, DeLiang
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2009, 126 (03) : 1415 - 1426
[18] A Simple Model of Speech Communication and its Application to Intelligibility Enhancement
Kleijn, W. Bastiaan
Hendriks, R. C.
[J]. IEEE SIGNAL PROCESSING LETTERS, 2015, 22 (03) : 303 - 307
[19] Binaural Speech Intelligibility Estimation Using Deep Neural Networks
Kondo, Kazuhiro
Taira, Kazuya
Kobayashi, Yosuke
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1858 - 1862
[20] Koopman J., 2007, 8 EFAS C 10 DGA C HE

← 1 2 3 4 5 →