Automatic Speech Recognition: A survey of deep learning techniques and approaches

被引:0
作者
Ahlawat, Harsh [1 ]
Aggarwal, Naveen [1 ]
Gupta, Deepti [1 ]
机构
[1] University Institute of Engineering and Technology, Panjab University, Chandigarh
来源
International Journal of Cognitive Computing in Engineering | 2025年 / 6卷
关键词
Automatic Speech Recognition; Conformer; Datasets; Deep learning; Deep Neural Networks; Multilingual; Transformer;
D O I
10.1016/j.ijcce.2024.12.007
中图分类号
学科分类号
摘要
Significant research has been conducted during the last decade on the application of machine learning for speech processing, particularly speech recognition. However, in recent years, deep learning models have shown promising results for different speech related applications. With the emergence of end-to-end models, deep learning has revolutionized the field of Automatic Speech Recognition (ASR). A recent surge in transfer learning-based models and attention-based approaches on large datasets has further given an impetus to ASR. This paper provides a thorough review of the numerous studies conducted since 2010, as well as an extensive comparison of the state-of-the-art methods that are now being used in this research area, with a special focus on the numerous deep learning models, along with an analysis of contemporary approaches for both monolingual and multilingual models. Deep learning approaches are data dependent and their accuracy varies on different datasets. In this paper, we have also analyzed the various models on publicly accessible speech datasets to understand model performance across diverse datasets for practical deployment. This study also highlights the research findings and challenges with way forward that may be used as a beginning point for academicians interested in open-source Automatic Speech Recognition (ASR) research, particularly focusing on mitigating data dependency and generalizability across low resource languages, speaker variability, and noise conditions. © 2025 The Authors
引用
收藏
页码:201 / 237
页数:36
相关论文
共 195 条
[41]  
Cui J., Kingsbury B., Ramabhadran B., Saon G., Sercu T., Audhkhasi K., Et al., Knowledge distillation across ensembles of multilingual models for low-resource languages, 2017 IEEE international conference on acoustics, speech and signal processing, pp. 4825-4829, (2017)
[42]  
Cui J., Kingsbury B., Ramabhadran B., Sethy A., Audhkhasi K., Cui X., Et al., Multilingual representations for low resource speech recognition and keyword search, 2015 IEEE workshop on automatic speech recognition and understanding, pp. 259-266, (2015)
[43]  
Dahl G.E., Yu D., Deng L., Acero A., Large vocabulary continuous speech recognition with context-dependent DBN-hmms, 2011 IEEE international conference on acoustics, speech and signal processing, pp. 4688-4691, (2011)
[44]  
Dash D., Kim M.J., Teplansky K., Wang J., (2018)
[45]  
Deng L., Hinton G., Kingsbury B., New types of deep neural network learning for speech recognition and related applications: An overview, 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8599-8603, (2013)
[46]  
Devlin J., Chang M.-W., Lee K., Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding, (2018)
[47]  
Dhanjal A.S., Singh W., A comprehensive survey on automatic speech recognition using neural networks, Multimedia Tools and Applications, pp. 1-46, (2023)
[48]  
Dida H.A., Chakravarthy D., Rabbi F., ChatGPT and big data: Enhancing text-to-speech conversion, Mesopotamian Journal of Big Data, 2023, pp. 31-35, (2023)
[49]  
Diwan A., Vaideeswaran R., Shah S., Singh A., Raghavan S., Khare S., Et al., Multilingual and code-switching ASR challenges for low resource Indian languages, (2021)
[50]  
Dong L., Xu S., Xu B., Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition, 2018 IEEE international conference on acoustics, speech and signal processing, pp. 5884-5888, (2018)