Training Data-Driven Speech Intelligibility Predictors on Heterogeneous Listening Test Data

被引:3
作者
Pedersen, Mathias Bach [1 ]
Andersen, Asger H. [2 ,3 ]
Jensen, Soren Holdt [4 ]
Tan, Zheng-Hua [1 ]
Jensen, Jesper [1 ,2 ]
机构
[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark
[2] Demant AS, DK-2765 Smorum, Denmark
[3] WS Audiol AS, DK-3540 Lynge, Denmark
[4] Danish Minist Def Estate Agcy, DK-9800 Hjorring, Denmark
关键词
Training; Indexes; Training data; Predictive models; Speech processing; Convolutional neural networks; Hidden Markov models; Neural networks; psychometric functions; speech intelligibility prediction; NOISE; ENHANCEMENT; INFORMATION; PERCEPTION; QUALITY;
D O I
10.1109/ACCESS.2022.3184785
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Prediction of Speech Intelligibility (SI) is a topic of interest for most speech processing applications, where intelligibility is of any importance, e.g., speech coding, transmission and enhancement. Traditionally, SI predictors have been based on signal processing methods and heuristics, but more recently, an increasing number of data-driven SI-predictors have been proposed. Data-driven prediction of SI requires large quantities of labelled data, ideally from many listening tests. Listening tests differ in factors such as vocabulary, talker, listener's task, etc. collectively referred to as the paradigm. A naive strategy of training SI-predictors directly on stimuli, pooled from different listening tests, is futile because the exact map from the stimulus to SI is determined, not only by the stimulus, but also by the paradigm. Data-driven SI-predictors trained in this way become specialized to the paradigms of the training data by erroneously attributing all paradigm influences on SI to the stimulus. The problem is fundamental and persists even in the idealized situation where training data is abundant. We propose a strategy for training data-driven SI-predictors that is independent of the paradigms, underlying the training data. The proposed strategy is to concatenate an SI-predictor and a layer of trainable dataset-specific mapping functions, each corresponding to a single paradigm in the training data. These mapping functions are trained jointly with the SI-predictor and serve to efficiently approximate the psychometric functions implied by each paradigm. The mapping functions prevent the predictor from specializing to these paradigms during training. We present an SI-predictor with a novel architecture that incorporates a convolutional network and an ESTOI back-end, train it with this strategy, compare it to naive training and a range of existing non-data-driven predictors. The proposed training strategy and architecture results in higher performance overall and increased robustness to unseen paradigms.
引用
收藏
页码:66175 / 66189
页数:15
相关论文
共 41 条
  • [1] Abadi M, 2015, Large-Scale Machine Learning on Heterogeneous Systems
  • [2] Acoustical Society of America, 1997, 351997 ANSI AC SOC A
  • [3] Allen JB, 2005, AUDITORY SIGNAL PROCESSINGP: PHYSIOLOGY, PSYCHOACOUSTICS, AND MODELS, P314
  • [4] Predicting the Intelligibility of Noisy and Nonlinearly Processed Binaural Speech
    Andersen, Asger Heidemann
    de Haan, Jan Mark
    Tan, Zheng-Hua
    Jensen, Jesper
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 1908 - 1920
  • [5] The impact of exploiting spectro-temporal context in computational speech segregation
    Bentsen, Thomas
    Kressner, Abigail A.
    Dau, Torsten
    May, Tobias
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2018, 143 (01) : 248 - 259
  • [6] A glimpsing model of speech perception in noise
    Cooke, M
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 119 (03) : 1562 - 1573
  • [7] Dreschler WA, 2001, AUDIOLOGY, V40, P148
  • [8] A spectro-temporal modulation index (STMI) for assessment of speech intelligibility
    Elhilali, M
    Chi, T
    Shamma, SA
    [J]. SPEECH COMMUNICATION, 2003, 41 (2-3) : 331 - 348
  • [9] A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech
    Falk, Tiago H.
    Zheng, Chenxi
    Chan, Wai-Yip
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07): : 1766 - 1774
  • [10] FACTORS GOVERNING THE INTELLIGIBILITY OF SPEECH SOUNDS
    FRENCH, NR
    STEINBERG, JC
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1947, 19 (01) : 90 - 119