A review of deep learning techniques for speech processing

被引:84
|
作者
Mehrish, Ambuj [1 ]
Majumder, Navonil [1 ]
Bharadwaj, Rishabh [1 ]
Mihalcea, Rada [2 ]
Poria, Soujanya [1 ]
机构
[1] Singapore Univ Technol & Design, ISTD, Singapore, Singapore
[2] Univ Michigan, Ann Arbor, MI USA
关键词
Deep learning; Speech processing; Transformers; Survey; Trends; TEXT-TO-SPEECH; CONVOLUTIONAL NEURAL-NETWORKS; UNSUPERVISED DOMAIN ADAPTATION; SPEAKER RECOGNITION; VOICE CONVERSION; WAVE-FORM; QUALITY PREDICTION; PLUS ALGORITHM; ENHANCEMENT; REPRESENTATION;
D O I
10.1016/j.inffus.2023.101869
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to -speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep -learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field.
引用
收藏
页数:55
相关论文
共 50 条
  • [21] Arabic Speech Recognition with Deep Learning: A Review
    Algihab, Wajdan
    Alawwad, Noura
    Aldawish, Anfal
    AlHumoud, Sarah
    SOCIAL COMPUTING AND SOCIAL MEDIA: DESIGN, HUMAN BEHAVIOR AND ANALYTICS, SCSM 2019, PT I, 2019, 11578 : 15 - 31
  • [22] A Review of Deep Learning Based Speech Synthesis
    Ning, Yishuang
    He, Sheng
    Wu, Zhiyong
    Xing, Chunxiao
    Zhang, Liang-Jie
    APPLIED SCIENCES-BASEL, 2019, 9 (19):
  • [23] Review of Modern Forest Fire Detection Techniques: Innovations in Image Processing and Deep Learning
    Ozel, Berk
    Alam, Muhammad Shahab
    Khan, Muhammad Umer
    INFORMATION, 2024, 15 (09)
  • [24] Emotional speech Recognition using CNN and Deep learning techniques
    Hema, C.
    Marquez, Fausto Pedro Garcia
    APPLIED ACOUSTICS, 2023, 211
  • [25] Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning
    Antonio Nicolas, Jose
    de Lope, Javier
    Grana, Manuel
    BIO-INSPIRED SYSTEMS AND APPLICATIONS: FROM ROBOTICS TO AMBIENT INTELLIGENCE, PT II, 2022, 13259 : 279 - 288
  • [26] Automatic Speech Recognition: A survey of deep learning techniques and approaches
    Ahlawat, Harsh
    Aggarwal, Naveen
    Gupta, Deepti
    International Journal of Cognitive Computing in Engineering, 2025, 6 : 201 - 237
  • [27] Deep Learning Based Point Cloud Processing Techniques
    Hazer, Abdurrahman
    Yildirim, Remzi
    IEEE ACCESS, 2022, 10 : 127237 - 127283
  • [28] Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages
    Baby, Arun
    Prakash, Jeena J.
    Vignesh, Rupak
    Murthy, Hema A.
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3817 - 3821
  • [29] Introduction to the Special Section on Deep Learning for Speech and Language Processing
    Yu, Dong
    Hinton, Geoffrey
    Morgan, Nelson
    Chien, Jen-Tzung
    Sagayama, Shigeki
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 4 - 6
  • [30] Deep learning: from speech recognition to language and multimodal processing
    Deng, Li
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2016, 5