End-to-End Multi-Modal Speech Recognition on an Air and Bone Conducted Speech Corpus

被引:14
|
作者
Wang, Mou [1 ]
Chen, Junqi [1 ]
Zhang, Xiao-Lei [1 ]
Rahardja, Susanto [1 ,2 ]
机构
[1] Northwestern Polytech Univ, Sch Marine Sci & Technol, Xian 710072, Peoples R China
[2] Singapore Inst Technol, Singapore 138683, Singapore
基金
美国国家科学基金会;
关键词
Speech recognition; multi-modal speech processing; bone conduction; air- and bone-conducted speech corpus; NOISE; ENHANCEMENT; NETWORK;
D O I
10.1109/TASLP.2022.3224305
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automatic speech recognition (ASR) has been significantly improved in the past years. However, most robust ASR systems are based on air-conducted (AC) speech, and their performances in low signal-to-noise-ratio (SNR) conditions are not satisfactory. Bone-conducted (BC) speech is intrinsically insensitive to environmental noise, and therefore can be used as an auxiliary source for improving the performance of an ASR at low SNR. In this paper, we first develop a multi-modal Mandarin corpus, which contains air- and bone-conducted synchronized speech (ABCS). The multi-modal speeches are recorded with a headset equipped with both AC and BC microphones. To our knowledge, it is by far the largest corpus for conducting bone conduction ASR research. Then, we propose a multi-modal conformer ASR system based on a novel multi-modal transducer (MMT). The proposed system extracts semantic embeddings from the AC and BC speech signals by a conformer-based encoder and a transformer-based truncated decoder. The semantic embeddings of the two speech sources are fused dynamically with adaptive weights by the MMT module. Experimental results demonstrate the proposed multi-modal system outperforms single-modal systems with either AC or BC modality and multi-modal baseline system by a large margin at various SNR levels. It also shows the two modalities complement with each other, and our method can effectively utilize the complementary information of different sources.
引用
收藏
页码:513 / 524
页数:12
相关论文
共 50 条
  • [31] End-to-End Speech Recognition of Tamil Language
    Changrampadi, Mohamed Hashim
    Shahina, A.
    Narayanan, M. Badri
    Khan, A. Nayeemulla
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323
  • [32] PARAMETER UNCERTAINTY FOR END-TO-END SPEECH RECOGNITION
    Braun, Stefan
    Liu, Shih-Chii
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5636 - 5640
  • [33] END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS
    Petridis, Stavros
    Li, Zuwei
    Pantic, Maja
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2592 - 2596
  • [34] An End-to-End model for Vietnamese speech recognition
    Van Huy Nguyen
    2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
  • [35] Review of End-to-End Streaming Speech Recognition
    Wang, Aohui
    Zhang, Long
    Song, Wenyu
    Meng, Jie
    Computer Engineering and Applications, 2024, 59 (02) : 22 - 33
  • [36] End-to-End Speech Recognition and Disfluency Removal
    Lou, Paria Jamshid
    Johnson, Mark
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2051 - 2061
  • [37] Performance Monitoring for End-to-End Speech Recognition
    Li, Ruizhi
    Sell, Gregory
    Hermansky, Hynek
    INTERSPEECH 2019, 2019, : 2245 - 2249
  • [38] TOWARDS END-TO-END UNSUPERVISED SPEECH RECOGNITION
    Liu, Alexander H.
    Hsu, Wei-Ning
    Auli, Michael
    Baevski, Alexei
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 221 - 228
  • [39] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670
  • [40] An Overview of End-to-End Automatic Speech Recognition
    Wang, Dong
    Wang, Xiaodong
    Lv, Shaohe
    SYMMETRY-BASEL, 2019, 11 (08):