Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

被引:0
|
作者
Gupta, Astha [1 ]
Kumar, Rakesh [1 ]
Kumar, Yogesh [2 ]
机构
[1] Chandigarh Univ, Dept Comp Sci & Engn, Mohali, Punjab, India
[2] Indus Univ, Indus Inst Technol & Engn, Ahmadabad, Gujarat, India
关键词
Automatic Speech Recognition; Spectrogram; Short Term Fourier transform; MFCC; ResNet10; Inception V3; VGG16; DenseNet201; EfficientNetB0;
D O I
10.1007/s11042-023-16748-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech is a natural phenomenon and a significant mode of communication used by humans that is divided into two categories, human-to-human and human-to-machine. Human-to-human communication depends on the language the speaker uses. In contrast, human-to-machine communication is a technique in which machines recognize human speech and act accordingly, often termed Automatic Speech Recognition (ASR). Recognition of Non-Indian language is challenging due to pitch variations and other factors such as accent, pronunciation, etc. This paper proposes a novel approach based on Dense Net201 and EfficientNetB0, i.e., a hybrid model for the recognition of Speech. Initially, 76,263 speech samples are taken from 11 non-Indian languages, including Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Japanese, Russian, Spanish and Persian. When collected, these speech samples are pre-processed by removing noise. Then, Spectrogram, Short-Term Fourier Transform (STFT), Spectral Rolloff-Bandwidth, Mel-frequency Cepstral Coefficient (MFCC), and Chroma feature are used to extract features from the speech sample. Further, a comparative analysis of the proposed approach is shown with other Deep Learning (DL) models like ResNet10, Inception V3, VGG16, DenseNet201, and EfficientNetB0. Standard parameters like Precision, Recall, F1-Score, Confusion Matrix, Accuracy, and Loss curves are used to evaluate the performance of each model by considering speech samples from all the languages mentioned above. Thus, the experimental results show that the hybrid model stands out from all the other models by giving the highest recognition accuracy of 99.84% with a loss of 0.004%.
引用
收藏
页码:30145 / 30166
页数:22
相关论文
共 50 条
  • [21] A closer look at reinforcement learning-based automatic speech recognition
    Yang, Fan
    Yang, Muqiao
    Li, Xiang
    Wu, Yuxuan
    Zhao, Zhiyuan
    Raj, Bhiksha
    Singh, Rita
    COMPUTER SPEECH AND LANGUAGE, 2024, 87
  • [22] SPEECH PATTERN BASED BLACK-BOX MODEL WATERMARKING FOR AUTOMATIC SPEECH RECOGNITION
    Chen, Haozhe
    Zhang, Weiming
    Liu, Kunlin
    Chen, Kejiang
    Fang, Han
    Yu, Nenghai
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3059 - 3063
  • [23] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Mishra, Swami
    Bhatnagar, Nehal
    Prakasam, P.
    Sureshkumar, T. R.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37603 - 37620
  • [24] Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition
    Purushothaman, Anurenjan
    Sreeram, Anirudh
    Kumar, Rohit
    Ganapathy, Sriram
    INTERSPEECH 2020, 2020, : 1688 - 1692
  • [25] Model based feature enhancement for automatic speech recognition in reverberant environments
    Krueger, Alexander
    Haeb-Umbach, Reinhold
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1239 - 1242
  • [26] Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition
    Liu, Yuzong
    Kirchhoff, Katrin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 1946 - 1956
  • [27] Auditory Model Based Optimization of MFCCs Improves Automatic Speech Recognition Performance
    Chatterjee, Saikat
    Koniaris, Christos
    Kleijn, W. Bastiaan
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2943 - 2946
  • [28] Language dialect based speech emotion recognition through deep learning techniques
    Rajendran, Sukumar
    Mathivanan, Sandeep Kumar
    Jayagopal, Prabhu
    Venkatasen, Maheshwari
    Pandi, Thanapal
    Sorakaya Somanathan, Manivannan
    Thangaval, Muthamilselvan
    Mani, Prasanna
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (03) : 625 - 635
  • [29] Language dialect based speech emotion recognition through deep learning techniques
    Sukumar Rajendran
    Sandeep Kumar Mathivanan
    Prabhu Jayagopal
    Maheshwari Venkatasen
    Thanapal Pandi
    Manivannan Sorakaya Somanathan
    Muthamilselvan Thangaval
    Prasanna Mani
    International Journal of Speech Technology, 2021, 24 : 625 - 635
  • [30] Automatic Speech Recognition in L2 Learning: A Review Based on PRISMA Methodology
    Farrus, Mireia
    LANGUAGES, 2023, 8 (04)