An efficient joint training model for monaural noisy-reverberant speech recognition

被引：0

作者：

Lian, Xiaoyu ^{[1
]}

Xia, Nan ^{[1
]}

Dai, Gaole ^{[1
]}

Yang, Hongqin ^{[1
]}

机构：

[1] Dalian Polytech Univ, Sch Informat Sci & Engn, Dalian 116034, Liaoning, Peoples R China

来源：

APPLIED ACOUSTICS | 2025年 / 228卷

关键词：

Deep learning; Speech enhancement; Speech recognition; Attention mechanism; Joint training; NETWORKS; ENHANCEMENT; FRAMEWORK; SIGNAL;

D O I：

10.1016/j.apacoust.2024.110322

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Noise and reverberation can seriously reduce speech quality and intelligibility, affecting the performance of downstream speech recognition tasks. This paper constructs a joint training speech recognition network for speech recognition in monaural noisy-reverberant environments. In the speech enhancement model, a complex-valued channel and temporal-frequency attention (CCTFA) are integrated to focus on the key features of the complex spectrum. Then the CCTFA network (CCTFANet) is constructed to reduce the influence of noise and reverberation. In the speech recognition model, an element-wise linear attention (EWLA) module is proposed to linearize the attention complexity and reduce the number of parameters and computations required for the attention module. Then the EWLA Conformer (EWLAC) is constructed as an efficient end-to-end speech recognition model. On the open source dataset, joint training of CCTFANet with EWLAC reduces the CER by 3.27%. Compared to other speech recognition models, EWLAC maintains CER while achieving much lower parameter count, computational overhead, and higher inference speed.

引用

页数：13

共 50 条

[21] SNRi Target Training for Joint Speech Enhancement and Recognition
Koizumi, Yuma
Karita, Shigeki
Narayanan, Arun
Panchapagesan, Sankaran
Bacchiani, Michiel
INTERSPEECH 2022, 2022, : 1173 - 1177
[22] Multi-Channel Training for End-to-End Speaker Recognition under Reverberant and Noisy Environment
Cai, Danwei
Qin, Xiaoyi
Li, Ming
INTERSPEECH 2019, 2019, : 4365 - 4369
[23] Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation
Md Jahangir Alam
Vishwa Gupta
Patrick Kenny
Pierre Dumouchel
EURASIP Journal on Advances in Signal Processing, 2015
[24] Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation
Alam, Md Jahangir
Gupta, Vishwa
Kenny, Patrick
Dumouchel, Pierre
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015, : 1 - 13
[25] Robust front-end for speech recognition by human and machine in noisy reverberant environments: the effect of phase information
Liu, Yang
Nower, Naushin
Morita, Shota
Unoki, Masashi
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[26] A Global Discriminant Joint Training Framework for Robust Speech Recognition
Li, Lujun
Kuerzinger, Ludwig
Watzel, Tobias
Rigoll, Gerhard
2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 544 - 551
[27] A Unified Recognition and Correction Model under Noisy and Accent Speech Conditions
Yang, Zhao
Ng, Dianwen
Zhang, Chong
Jiang, Rui
Xi, Wei
Ma, Yukun
Ni, Chongjia
Zhao, Jizhong
Ma, Bin
Chng, Eng Siong
INTERSPEECH 2023, 2023, : 4953 - 4957
[28] Auditory model for robust speech recognition in real world noisy environments
Kim, DS
Lee, SY
Kil, RM
Zhu, XL
ELECTRONICS LETTERS, 1997, 33 (01) : 12 - 13
[29] Joint Bottleneck Feature and Attention Model for Speech Recognition
Long Xingyan
Qu Dan
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MATHEMATICS AND ARTIFICIAL INTELLIGENCE (ICMAI 2018), 2018, : 46 - 50
[30] Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition
Wiesner, Matthew
Sarma, Mousmita
Arora, Ashish
Raj, Desh
Gao, Dongji
Huang, Ruizhe
Preet, Supreet
Johnson, Moris
Iqbal, Zikra
Goel, Nagendra
Trmal, Jan
Garcia, Paola
Khudanpur, Sanjeev
INTERSPEECH 2021, 2021, : 2906 - 2910

← 1 2 3 4 5 →