End-to-End Multi-Modal Speech Recognition on an Air and Bone Conducted Speech Corpus

被引:14
|
作者
Wang, Mou [1 ]
Chen, Junqi [1 ]
Zhang, Xiao-Lei [1 ]
Rahardja, Susanto [1 ,2 ]
机构
[1] Northwestern Polytech Univ, Sch Marine Sci & Technol, Xian 710072, Peoples R China
[2] Singapore Inst Technol, Singapore 138683, Singapore
基金
美国国家科学基金会;
关键词
Speech recognition; multi-modal speech processing; bone conduction; air- and bone-conducted speech corpus; NOISE; ENHANCEMENT; NETWORK;
D O I
10.1109/TASLP.2022.3224305
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automatic speech recognition (ASR) has been significantly improved in the past years. However, most robust ASR systems are based on air-conducted (AC) speech, and their performances in low signal-to-noise-ratio (SNR) conditions are not satisfactory. Bone-conducted (BC) speech is intrinsically insensitive to environmental noise, and therefore can be used as an auxiliary source for improving the performance of an ASR at low SNR. In this paper, we first develop a multi-modal Mandarin corpus, which contains air- and bone-conducted synchronized speech (ABCS). The multi-modal speeches are recorded with a headset equipped with both AC and BC microphones. To our knowledge, it is by far the largest corpus for conducting bone conduction ASR research. Then, we propose a multi-modal conformer ASR system based on a novel multi-modal transducer (MMT). The proposed system extracts semantic embeddings from the AC and BC speech signals by a conformer-based encoder and a transformer-based truncated decoder. The semantic embeddings of the two speech sources are fused dynamically with adaptive weights by the MMT module. Experimental results demonstrate the proposed multi-modal system outperforms single-modal systems with either AC or BC modality and multi-modal baseline system by a large margin at various SNR levels. It also shows the two modalities complement with each other, and our method can effectively utilize the complementary information of different sources.
引用
收藏
页码:513 / 524
页数:12
相关论文
共 50 条
  • [21] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [22] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [23] SPEECH ENHANCEMENT USING END-TO-END SPEECH RECOGNITION OBJECTIVES
    Subramanian, Aswin Shanmugam
    Wang, Xiaofei
    Baskar, Murali Karthick
    Watanabe, Shinji
    Taniguchi, Toru
    Tran, Dung
    Fujita, Yuya
    2019 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2019, : 234 - 238
  • [24] DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition
    Zhang, Yuhao
    Hossain, Md Zakir
    Rahman, Shafin
    HUMAN-COMPUTER INTERACTION, INTERACT 2021, PT III, 2021, 12934 : 227 - 237
  • [25] Towards multilingual end-to-end speech recognition for air traffic control
    Lin, Yi
    Yang, Bo
    Guo, Dongyue
    Fan, Peng
    IET INTELLIGENT TRANSPORT SYSTEMS, 2021, 15 (09) : 1203 - 1214
  • [26] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244
  • [27] End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets
    Gambhir, Pooja
    Dev, Amita
    Bansal, Poonam
    Sharma, Deepak Kumar
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (01)
  • [28] Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
    Fukuda, Ryo
    Sudoh, Katsuhito
    Nakamura, Satoshi
    INTERSPEECH 2022, 2022, : 121 - 125
  • [29] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
    Tian, Zhengkun
    Yi, Jiangyan
    Bai, Ye
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888
  • [30] End-to-End Speech Recognition For Arabic Dialects
    Seham Nasr
    Rehab Duwairi
    Muhannad Quwaider
    Arabian Journal for Science and Engineering, 2023, 48 : 10617 - 10633