Sequence-to-Sequence Multi-Modal Speech In-Painting

被引:0
|
作者
Elyaderani, Mahsa Kadkhodaei [1 ]
Shirani, Shahram [1 ]
机构
[1] McMaster Univ, Dept Computat Sci & Engn, Hamilton, ON, Canada
来源
关键词
speech enhancement; speech in-painting; sequence-to-sequence models; multi-modality; Long Short-Term Memory networks; AUDIO; INTERPOLATION;
D O I
10.21437/Interspeech.2023-1848
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.
引用
收藏
页码:829 / 833
页数:5
相关论文
共 50 条
  • [1] Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2021, 42 (06) : 333 - 343
  • [2] Multi-modal Sequence-to-sequence Model for Continuous Affect Prediction in the Wild Using Deep 3D Features
    Rasipuram, Sowmya
    Bhat, Junaid Hamid
    Maitra, Anutosh
    2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 611 - 614
  • [3] MULTI-DIALECT SPEECH RECOGNITION WITH A SINGLE SEQUENCE-TO-SEQUENCE MODEL
    Li, Bo
    Sainath, Tara N.
    Sim, Khe Chai
    Bacchiani, Michiel
    Weinstein, Eugene
    Nguyen, Patrick
    Chen, Zhifeng
    Wu, Yonghui
    Rao, Kanishka
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4749 - 4753
  • [4] DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
    Gu, Xiaodong
    Zhang, Hongyu
    Zhang, Dongmei
    Kim, Sunghun
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3675 - 3681
  • [5] MULTIMODAL GROUNDING FOR SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION
    Caglayan, Ozan
    Sanabria, Ramon
    Palaskar, Shruti
    Barrault, Loic
    Metze, Florian
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 8648 - 8652
  • [6] Sequence-to-Sequence Models for Emphasis Speech Translation
    Quoc Truong Do
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) : 1873 - 1883
  • [7] A Comparison of Sequence-to-Sequence Models for Speech Recognition
    Prabhavalkar, Rohit
    Rao, Kanishka
    Sainath, Tara N.
    Li, Bo
    Johnson, Leif
    Jaitly, Navdeep
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 939 - 943
  • [8] Advancing sequence-to-sequence based speech recognition
    Tuske, Zoltan
    Audhkhasi, Kartik
    Saon, George
    INTERSPEECH 2019, 2019, : 3780 - 3784
  • [9] Direct speech-to-speech translation with a sequence-to-sequence model
    Jia, Ye
    Weiss, Ron J.
    Biadsy, Fadi
    Macherey, Wolfgang
    Johnson, Melvin
    Chen, Zhifeng
    Wu, Yonghui
    INTERSPEECH 2019, 2019, : 1123 - 1127
  • [10] From Speech to Facial Activity: Towards Cross-modal Sequence-to-Sequence Attention Networks
    Stappen, Lukas
    Karas, Vincent
    Cummins, Nicholas
    Ringeval, Fabien
    Scherer, Klaus
    Schuller, Bjorn
    2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,