A review of deep learning techniques for speech processing

被引:84
|
作者
Mehrish, Ambuj [1 ]
Majumder, Navonil [1 ]
Bharadwaj, Rishabh [1 ]
Mihalcea, Rada [2 ]
Poria, Soujanya [1 ]
机构
[1] Singapore Univ Technol & Design, ISTD, Singapore, Singapore
[2] Univ Michigan, Ann Arbor, MI USA
关键词
Deep learning; Speech processing; Transformers; Survey; Trends; TEXT-TO-SPEECH; CONVOLUTIONAL NEURAL-NETWORKS; UNSUPERVISED DOMAIN ADAPTATION; SPEAKER RECOGNITION; VOICE CONVERSION; WAVE-FORM; QUALITY PREDICTION; PLUS ALGORITHM; ENHANCEMENT; REPRESENTATION;
D O I
10.1016/j.inffus.2023.101869
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to -speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep -learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field.
引用
收藏
页数:55
相关论文
共 50 条
  • [1] Deep Learning Techniques for Speech Emotion Recognition : A Review
    Pandey, Sandeep Kumar
    Shekhawat, H. S.
    Prasanna, S. R. M.
    2019 29TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA (RADIOELEKTRONIKA), 2019, : 197 - 202
  • [2] Speech Emotion Recognition Using Deep Learning Techniques: A Review
    Khalil, Ruhul Amin
    Jones, Edward
    Babar, Mohammad Inayatullah
    Jan, Tariqullah
    Zafar, Mohammad Haseeb
    Alhussain, Thamer
    IEEE ACCESS, 2019, 7 : 117327 - 117345
  • [3] Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques
    Haeb-Umbach, Reinhold
    Watanabe, Shinji
    Nakatani, Tomohiro
    Bacchiani, Michiel
    Hoffmeister, Bjoern
    Seltzer, Michael L.
    Zen, Heiga
    Souden, Mehrez
    IEEE SIGNAL PROCESSING MAGAZINE, 2019, 36 (06) : 111 - 124
  • [4] Speech and language processing with deep learning for dementia diagnosis: A systematic review
    Shi, Mengke
    Cheung, Gary
    Shahamiri, Seyed Reza
    PSYCHIATRY RESEARCH, 2023, 329
  • [5] Semantic speech analysis using machine learning and deep learning techniques: a comprehensive review
    Tyagi, Suryakant
    Szenasi, Sandor
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (29) : 73427 - 73456
  • [6] Video Processing Using Deep Learning Techniques: A Systematic Literature Review
    Sharma, Vijeta
    Gupta, Manjari
    Kumar, Ajai
    Mishra, Deepti
    IEEE ACCESS, 2021, 9 : 139489 - 139507
  • [7] Speech Enhancement: Traditional and Deep Learning Techniques
    Gaddamedi, Satya Prasad
    Patel, Anuj
    Chandra, Sabyasachi
    Bharati, Puja
    Ghosh, Nirmalya
    Das Mandal, Shyamal Kumar
    PROCEEDINGS OF 27TH INTERNATIONAL SYMPOSIUM ON FRONTIERS OF RESEARCH IN SPEECH AND MUSIC, FRSM 2023, 2024, 1455 : 75 - 86
  • [8] Survey of Deep Learning Paradigms for Speech Processing
    Kishor Barasu Bhangale
    Mohanaprasad Kothandaraman
    Wireless Personal Communications, 2022, 125 : 1913 - 1949
  • [9] Survey of Deep Learning Paradigms for Speech Processing
    Bhangale, Kishor Barasu
    Kothandaraman, Mohanaprasad
    WIRELESS PERSONAL COMMUNICATIONS, 2022, 125 (02) : 1913 - 1949
  • [10] Deep learning techniques for biomedical data processing
    Bianchini, Monica
    Dimitri, Giovanna Maria
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2023, 17 (01): : 211 - 228