End-to-End Multi-Modal Speech Recognition on an Air and Bone Conducted Speech Corpus

被引:14
|
作者
Wang, Mou [1 ]
Chen, Junqi [1 ]
Zhang, Xiao-Lei [1 ]
Rahardja, Susanto [1 ,2 ]
机构
[1] Northwestern Polytech Univ, Sch Marine Sci & Technol, Xian 710072, Peoples R China
[2] Singapore Inst Technol, Singapore 138683, Singapore
基金
美国国家科学基金会;
关键词
Speech recognition; multi-modal speech processing; bone conduction; air- and bone-conducted speech corpus; NOISE; ENHANCEMENT; NETWORK;
D O I
10.1109/TASLP.2022.3224305
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automatic speech recognition (ASR) has been significantly improved in the past years. However, most robust ASR systems are based on air-conducted (AC) speech, and their performances in low signal-to-noise-ratio (SNR) conditions are not satisfactory. Bone-conducted (BC) speech is intrinsically insensitive to environmental noise, and therefore can be used as an auxiliary source for improving the performance of an ASR at low SNR. In this paper, we first develop a multi-modal Mandarin corpus, which contains air- and bone-conducted synchronized speech (ABCS). The multi-modal speeches are recorded with a headset equipped with both AC and BC microphones. To our knowledge, it is by far the largest corpus for conducting bone conduction ASR research. Then, we propose a multi-modal conformer ASR system based on a novel multi-modal transducer (MMT). The proposed system extracts semantic embeddings from the AC and BC speech signals by a conformer-based encoder and a transformer-based truncated decoder. The semantic embeddings of the two speech sources are fused dynamically with adaptive weights by the MMT module. Experimental results demonstrate the proposed multi-modal system outperforms single-modal systems with either AC or BC modality and multi-modal baseline system by a large margin at various SNR levels. It also shows the two modalities complement with each other, and our method can effectively utilize the complementary information of different sources.
引用
收藏
页码:513 / 524
页数:12
相关论文
共 50 条
  • [1] END-TO-END MULTI-MODAL SPEECH RECOGNITION WITH AIR AND BONE CONDUCTED SPEECH
    Chen, Junqi
    Wang, Mou
    Zhang, Xiao-Lei
    Huang, Zhiyong
    Rahardja, Susanto
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6052 - 6056
  • [2] Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language
    Matsuura, Kohei
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2622 - 2628
  • [3] Multi-Stream End-to-End Speech Recognition
    Li, Ruizhi
    Wang, Xiaofei
    Mallidi, Sri Harish
    Watanabe, Shinji
    Hori, Takaaki
    Hermansky, Hynek
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (646-655) : 646 - 655
  • [4] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [5] Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement
    Yu, Cheng
    Hung, Kuo-Hsuan
    Wang, Syu-Siang
    Tsao, Yu
    Hung, Jeih-weih
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 1035 - 1039
  • [6] Multi-modal speech enhancement with bone-conducted speech in time domain
    Wang, Mou
    Chen, Junqi
    Zhang, Xiaolei
    Huang, Zhiyong
    Rahardja, Susanto
    APPLIED ACOUSTICS, 2022, 200
  • [7] End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition
    Kim, Suyoun
    Lane, Ian
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3867 - 3871
  • [8] End-to-End Speech Recognition in Russian
    Markovnikov, Nikita
    Kipyatkova, Irina
    Lyakso, Elena
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
  • [9] END-TO-END MULTIMODAL SPEECH RECOGNITION
    Palaskar, Shruti
    Sanabria, Ramon
    Metze, Florian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
  • [10] Overview of end-to-end speech recognition
    Wang, Song
    Li, Guanyu
    2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187