Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

被引：0

作者：

Gupta, Astha ^{[1
]}

Kumar, Rakesh ^{[1
]}

Kumar, Yogesh ^{[2
]}

机构：

[1] Chandigarh Univ, Dept Comp Sci & Engn, Mohali, Punjab, India

[2] Indus Univ, Indus Inst Technol & Engn, Ahmadabad, Gujarat, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 10期

关键词：

Automatic Speech Recognition; Spectrogram; Short Term Fourier transform; MFCC; ResNet10; Inception V3; VGG16; DenseNet201; EfficientNetB0;

D O I：

10.1007/s11042-023-16748-1

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Speech is a natural phenomenon and a significant mode of communication used by humans that is divided into two categories, human-to-human and human-to-machine. Human-to-human communication depends on the language the speaker uses. In contrast, human-to-machine communication is a technique in which machines recognize human speech and act accordingly, often termed Automatic Speech Recognition (ASR). Recognition of Non-Indian language is challenging due to pitch variations and other factors such as accent, pronunciation, etc. This paper proposes a novel approach based on Dense Net201 and EfficientNetB0, i.e., a hybrid model for the recognition of Speech. Initially, 76,263 speech samples are taken from 11 non-Indian languages, including Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Japanese, Russian, Spanish and Persian. When collected, these speech samples are pre-processed by removing noise. Then, Spectrogram, Short-Term Fourier Transform (STFT), Spectral Rolloff-Bandwidth, Mel-frequency Cepstral Coefficient (MFCC), and Chroma feature are used to extract features from the speech sample. Further, a comparative analysis of the proposed approach is shown with other Deep Learning (DL) models like ResNet10, Inception V3, VGG16, DenseNet201, and EfficientNetB0. Standard parameters like Precision, Recall, F1-Score, Confusion Matrix, Accuracy, and Loss curves are used to evaluate the performance of each model by considering speech samples from all the languages mentioned above. Thus, the experimental results show that the hybrid model stands out from all the other models by giving the highest recognition accuracy of 99.84% with a loss of 0.004%.

引用

页码：30145 / 30166

页数：22

共 50 条

[41] Identifying depression-related topics in smartphone-collected free-response speech recordings using an automatic speech recognition system and a deep learning topic model
Zhang, Yuezhou
Folarin, Amos A.
Dineley, Judith
Conde, Pauline
de Angel, Valeria
Sun, Shaoxiong
Ranjan, Yatharth
Rashid, Zulqarnain
Stewart, Callum
Laiou, Petroula
Sankesara, Heet
Qian, Linglong
Matcham, Faith
White, Katie
Oetzmann, Carolin
Lamers, Femke
Siddi, Sara
Simblett, Sara
Schuller, Bjorn W.
Vairavan, Srinivasan
Wykes, Til
Haro, Josep Maria
Penninx, Brenda W. J. H.
Narayan, Vaibhav A.
Hotopf, Matthew
Dobson, Richard J. B.
Cummins, Nicholas
JOURNAL OF AFFECTIVE DISORDERS, 2024, 355 : 40 - 49
[42] Automatic speech emotion recognition based on hybrid features with ANN, LDA and K_NN classifiers
Al Dujaili, Mohammed Jawad
Ebrahimi-Moghadam, Abbas
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42783 - 42801
[43] Automatic speech emotion recognition based on hybrid features with ANN, LDA and K_NN classifiers
Mohammed Jawad Al Dujaili
Abbas Ebrahimi-Moghadam
Multimedia Tools and Applications, 2023, 82 : 42783 - 42801
[44] On a Hybrid NN/HMM Speech Recognition System with a RNN-Based Language Model
Soutner, Daniel
Zelinka, Jan
Mueller, Ludek
SPEECH AND COMPUTER, 2014, 8773 : 315 - 321
[45] Hybrid SVM/HMM Model for the Recognition of Arabic Triphones-based Continuous Speech
Zarrouk, Elyes
Benayed, Yassine
Gargouri, Faiez
2013 10TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2013,
[46] Acoustic Model Optimization Based On Evolutionary Stochastic Gradient Descent with Anchors for Automatic Speech Recognition
Cui, Xiaodong
Picheny, Michael
INTERSPEECH 2019, 2019, : 1581 - 1585
[47] On the relevance. of auditory-based Gabor features for deep learning in robust speech recognition
Martinez, Angel Mario Castro
Mallidi, Sri Harish
Meyer, Bernd T.
COMPUTER SPEECH AND LANGUAGE, 2017, 45 : 21 - 38
[48] Sub-voice Detection and Recognition based on Hybrid Audio Segmentation and Deep Learning
Zhao, Xiaolei
Wang, Chenyin
Xu, Xibin
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ROBOTICS, INTELLIGENT CONTROL AND ARTIFICIAL INTELLIGENCE (RICAI 2019), 2019, : 143 - 147
[49] AUTOMATED MULTI-DIALECT SPEECH RECOGNITION USING STACKED ATTENTION-BASED DEEP LEARNING WITH NATURAL LANGUAGE PROCESSING MODEL
AL Mazroa, Alanoud
Miled, Achraf ben
Asiri, Mashael m
Alzahrani, Yazeed
Sayed, Ahmed
Nafie, Faisal mohammed
FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2024, 32 (09N10)
[50] Structure-Based and Template-Based Automatic Speech Recognition - Comparing parametric and non-parametric approaches
Deng, Li
Strik, Helmer
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2608 - +

← 1 2 3 4 5 →