On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

被引：13

作者：

Gallardo-Antolin, Ascension ^{[1
]}

Montero, Juan M. ^{[2
]}

机构：

[1] Univ Carlos III Madrid, Dept Signal Theory & Commun, Avda Univ 30, Madrid 28911, Spain

[2] Univ Politecn Madrid, Speech Technol Grp, ETSIT, Avda Complutense 30, Madrid 28040, Spain

来源：

NEUROCOMPUTING | 2021年 / 456卷

关键词：

Speech intelligibility; LSTM; Attention; Acoustic spectrogram; Modulation spectrogram; Fusion; DYSARTHRIA;

D O I：

10.1016/j.neucom.2021.05.065

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results. (c) 2021 Elsevier B.V. All rights reserved.

引用

页码：49 / 60

页数：12

共 40 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2]

[Anonymous], 1976, PATTERN RECOGN

[3] Automatic Intelligibility Assessment of Speakers After Laryngeal Cancer by Means of Acoustic Modeling [J].

Bocklet, Tobias ;

Riedhammer, Korbinian ;

Noeth, Elmar ;

Eysholdt, Ulrich ;

Haderlein, Tino .

JOURNAL OF VOICE, 2012, 26 (03) :390-397

[4]

Byeon H, 2018, INT J ADV COMPUT SC, V9, P88

[5]

Chorowski J, 2015, ADV NEUR IN, V28

[6] Intelligibility as a linear combination of dimensions in dysarthric speech [J].

De Bodt, MS ;

Huici, MEHD ;

Van De Heyning, PH .

JOURNAL OF COMMUNICATION DISORDERS, 2002, 35 (03) :283-292

[7]

Doyle PC, 1997, J REHABIL RES DEV, V34, P309

[8] Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility [J].

Falk, Tiago H. ;

Chan, Wai-Yip ;

Shein, Fraser .

SPEECH COMMUNICATION, 2012, 54 (05) :622-631

[9] A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech [J].

Falk, Tiago H. ;

Zheng, Chenxi ;

Chan, Wai-Yip .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07) :1766-1774

[10] An attention Long Short-Term Memory based system for automatic classification of speech intelligibility [J].

Fernandez-Diaz, Miguel ;

Gallardo-Antolin, Ascension .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2020, 96

← 1 2 3 4 →