Improving speech command recognition through decision-level fusion of deep filtered speech cues

被引:5
|
作者
Mehra, Sunakshi [1 ]
Ranga, Virender [1 ]
Agarwal, Ritu [1 ]
机构
[1] Delhi Technol Univ, Dept Informat Technol, Delhi, India
关键词
Speech filtering techniques; Swin-tiny transformer; Feed-forward neural network (FNN); Speech command recognition; ENHANCEMENT;
D O I
10.1007/s11760-023-02845-z
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.
引用
收藏
页码:1365 / 1373
页数:9
相关论文
共 50 条
  • [41] Improving speech recognition using data augmentation and acoustic model fusion
    Rebai, Ilyes
    BenAyed, Yessine
    Mahdi, Walid
    Lorre, Jean-Pierre
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS, 2017, 112 : 316 - 322
  • [42] Improving Emotion Recognition From Speech Using Sensor Fusion Techniques
    Vasuki, P.
    Aravindan, Chandrabose
    TENCON 2012 - 2012 IEEE REGION 10 CONFERENCE: SUSTAINABLE DEVELOPMENT THROUGH HUMANITARIAN TECHNOLOGY, 2012,
  • [43] Speech Emotion Recognition of Decision Fusion Based on DS Evidence Theory
    Kuang, Yuanlu
    Li, Lijuan
    PROCEEDINGS OF 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2012, : 795 - 798
  • [44] Improving Traffic Accident Severity Prediction Using Convoluted Features and Decision-Level Fusion of Models
    Abuzinadah, Nihal
    Aljrees, Turki
    Chen, Xiaoyuan
    Umer, Muhammad
    Aboulola, Omar Ibrahim
    Tahir, Saba
    Eshmawi, Ala' Abdulmajid
    Alnowaiser, Khaled
    Ashraf, Imran
    TRANSPORTATION RESEARCH RECORD, 2024, 2678 (08) : 731 - 744
  • [45] Multimodal Recognition of Emotions Using Physiological Signals with the Method of Decision-Level Fusion for Healthcare Applications
    Kone, Chaka
    Tayari, Imen Meftah
    Le-Thanh, Nhan
    Belleudy, Cecile
    Inclusive Smart Cities and e-Health, 2015, 9102 : 301 - 306
  • [46] Comparison between Decision-Level and Feature-Level Fusion of Acoustic and Linguistic Features for Spontaneous Emotion Recognition
    Planet, Santiago
    Iriondo, Ignasi
    SISTEMAS Y TECNOLOGIAS DE INFORMACION, VOLS 1 AND 2, 2012, : 199 - 204
  • [47] Comparison between Decision-Level and Feature-Level Fusion of Acoustic and Linguistic Features for Spontaneous Emotion Recognition
    Planet, Santiago
    Iriondo, Ignasi
    7TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI 2012), 2012,
  • [48] Decision-level fusion for single-view gait recognition with various carrying and clothing conditions
    Al-Tayyan, Amer
    Assaleh, Khaled
    Shanableh, Tamer
    IMAGE AND VISION COMPUTING, 2017, 61 : 54 - 69
  • [49] An autoencoder-based feature level fusion for speech emotion recognition
    Peng Shixin
    Chen Kai
    Tian Tian
    Chen Jingying
    Digital Communications and Networks, 2024, 10 (05) : 1341 - 1351
  • [50] An autoencoder-based feature level fusion for speech emotion recognition
    Peng, Shixin
    Kai, Chen
    Tian, Tian
    Chen, Jingying
    DIGITAL COMMUNICATIONS AND NETWORKS, 2024, 10 (05) : 1341 - 1351