A review of deep learning techniques for speech processing

被引：129

作者：

Mehrish, Ambuj ^{[1
]}

Majumder, Navonil ^{[1
]}

Bharadwaj, Rishabh ^{[1
]}

Mihalcea, Rada ^{[2
]}

Poria, Soujanya ^{[1
]}

机构：

[1] Singapore Univ Technol & Design, ISTD, Singapore, Singapore

[2] Univ Michigan, Ann Arbor, MI USA

来源：

INFORMATION FUSION | 2023年 / 99卷

关键词：

Deep learning; Speech processing; Transformers; Survey; Trends; TEXT-TO-SPEECH; CONVOLUTIONAL NEURAL-NETWORKS; UNSUPERVISED DOMAIN ADAPTATION; SPEAKER RECOGNITION; VOICE CONVERSION; WAVE-FORM; QUALITY PREDICTION; PLUS ALGORITHM; ENHANCEMENT; REPRESENTATION;

D O I：

10.1016/j.inffus.2023.101869

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to -speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep -learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field.

引用

页数：55

共 666 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

Abdel-Hamid O, 2012, INT CONF ACOUST SPEE, P4277, DOI 10.1109/ICASSP.2012.6288864

[3] Real-time vibration-based structural damage detection using one-dimensional convolutional neural networks [J].

Abdeljaber, Osama ;

Avci, Onur ;

Kiranyaz, Serkan ;

Gabbouj, Moncef ;

Inman, Daniel J. .

JOURNAL OF SOUND AND VIBRATION, 2017, 388 :154-170

[4] Mel Frequency Cepstral Coefficient and its Applications: A Review [J].

Abdul, Zrar Kh. ;

Al-Talabani, Abdulbasit K. K. .

IEEE ACCESS, 2022, 10 :122136-122158

[5] ON-DEVICE NEURAL SPEECH SYNTHESIS [J].

Achanta, Sivanand ;

Antony, Albert ;

Golipour, Ladan ;

Li, Jiangchuan ;

Raitio, Tuomo ;

Rasipuram, Ramya ;

Rossi, Francesco ;

Shi, Jennifer ;

Upadhyay, Jaimin ;

Winarsky, David .

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :1155-1161

[6]

Afouras T, 2018, Arxiv, DOI arXiv:1804.04121

[7]

Aggarwal V, 2020, INT CONF ACOUST SPEE, P6179, DOI [10.1109/ICASSP40776.2020.9053678, 10.1109/icassp40776.2020.9053678]

[8] Human-Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention [J].

Alsabhan, Waleed .

SENSORS, 2023, 23 (03)

[9]

Amodei D, 2016, PR MACH LEARN RES, V48

[10] Speaker Diarization: A Review of Recent Research [J].

Anguera Miro, Xavier ;

Bozonnet, Simon ;

Evans, Nicholas ;

Fredouille, Corinne ;

Friedland, Gerald ;

Vinyals, Oriol .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02) :356-370

← 1 2 3 4 5 6 7 8 9 10 →