Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

被引:0
|
作者
Gupta, Astha [1 ]
Kumar, Rakesh [1 ]
Kumar, Yogesh [2 ]
机构
[1] Chandigarh Univ, Dept Comp Sci & Engn, Mohali, Punjab, India
[2] Indus Univ, Indus Inst Technol & Engn, Ahmadabad, Gujarat, India
关键词
Automatic Speech Recognition; Spectrogram; Short Term Fourier transform; MFCC; ResNet10; Inception V3; VGG16; DenseNet201; EfficientNetB0;
D O I
10.1007/s11042-023-16748-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech is a natural phenomenon and a significant mode of communication used by humans that is divided into two categories, human-to-human and human-to-machine. Human-to-human communication depends on the language the speaker uses. In contrast, human-to-machine communication is a technique in which machines recognize human speech and act accordingly, often termed Automatic Speech Recognition (ASR). Recognition of Non-Indian language is challenging due to pitch variations and other factors such as accent, pronunciation, etc. This paper proposes a novel approach based on Dense Net201 and EfficientNetB0, i.e., a hybrid model for the recognition of Speech. Initially, 76,263 speech samples are taken from 11 non-Indian languages, including Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Japanese, Russian, Spanish and Persian. When collected, these speech samples are pre-processed by removing noise. Then, Spectrogram, Short-Term Fourier Transform (STFT), Spectral Rolloff-Bandwidth, Mel-frequency Cepstral Coefficient (MFCC), and Chroma feature are used to extract features from the speech sample. Further, a comparative analysis of the proposed approach is shown with other Deep Learning (DL) models like ResNet10, Inception V3, VGG16, DenseNet201, and EfficientNetB0. Standard parameters like Precision, Recall, F1-Score, Confusion Matrix, Accuracy, and Loss curves are used to evaluate the performance of each model by considering speech samples from all the languages mentioned above. Thus, the experimental results show that the hybrid model stands out from all the other models by giving the highest recognition accuracy of 99.84% with a loss of 0.004%.
引用
收藏
页码:30145 / 30166
页数:22
相关论文
共 50 条
  • [41] Identifying depression-related topics in smartphone-collected free-response speech recordings using an automatic speech recognition system and a deep learning topic model
    Zhang, Yuezhou
    Folarin, Amos A.
    Dineley, Judith
    Conde, Pauline
    de Angel, Valeria
    Sun, Shaoxiong
    Ranjan, Yatharth
    Rashid, Zulqarnain
    Stewart, Callum
    Laiou, Petroula
    Sankesara, Heet
    Qian, Linglong
    Matcham, Faith
    White, Katie
    Oetzmann, Carolin
    Lamers, Femke
    Siddi, Sara
    Simblett, Sara
    Schuller, Bjorn W.
    Vairavan, Srinivasan
    Wykes, Til
    Haro, Josep Maria
    Penninx, Brenda W. J. H.
    Narayan, Vaibhav A.
    Hotopf, Matthew
    Dobson, Richard J. B.
    Cummins, Nicholas
    JOURNAL OF AFFECTIVE DISORDERS, 2024, 355 : 40 - 49
  • [42] Automatic speech emotion recognition based on hybrid features with ANN, LDA and K_NN classifiers
    Al Dujaili, Mohammed Jawad
    Ebrahimi-Moghadam, Abbas
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42783 - 42801
  • [43] Automatic speech emotion recognition based on hybrid features with ANN, LDA and K_NN classifiers
    Mohammed Jawad Al Dujaili
    Abbas Ebrahimi-Moghadam
    Multimedia Tools and Applications, 2023, 82 : 42783 - 42801
  • [44] On a Hybrid NN/HMM Speech Recognition System with a RNN-Based Language Model
    Soutner, Daniel
    Zelinka, Jan
    Mueller, Ludek
    SPEECH AND COMPUTER, 2014, 8773 : 315 - 321
  • [45] Hybrid SVM/HMM Model for the Recognition of Arabic Triphones-based Continuous Speech
    Zarrouk, Elyes
    Benayed, Yassine
    Gargouri, Faiez
    2013 10TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2013,
  • [46] Acoustic Model Optimization Based On Evolutionary Stochastic Gradient Descent with Anchors for Automatic Speech Recognition
    Cui, Xiaodong
    Picheny, Michael
    INTERSPEECH 2019, 2019, : 1581 - 1585
  • [47] On the relevance. of auditory-based Gabor features for deep learning in robust speech recognition
    Martinez, Angel Mario Castro
    Mallidi, Sri Harish
    Meyer, Bernd T.
    COMPUTER SPEECH AND LANGUAGE, 2017, 45 : 21 - 38
  • [48] Sub-voice Detection and Recognition based on Hybrid Audio Segmentation and Deep Learning
    Zhao, Xiaolei
    Wang, Chenyin
    Xu, Xibin
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ROBOTICS, INTELLIGENT CONTROL AND ARTIFICIAL INTELLIGENCE (RICAI 2019), 2019, : 143 - 147
  • [49] AUTOMATED MULTI-DIALECT SPEECH RECOGNITION USING STACKED ATTENTION-BASED DEEP LEARNING WITH NATURAL LANGUAGE PROCESSING MODEL
    AL Mazroa, Alanoud
    Miled, Achraf ben
    Asiri, Mashael m
    Alzahrani, Yazeed
    Sayed, Ahmed
    Nafie, Faisal mohammed
    FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2024, 32 (09N10)
  • [50] Structure-Based and Template-Based Automatic Speech Recognition - Comparing parametric and non-parametric approaches
    Deng, Li
    Strik, Helmer
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2608 - +