Sequence-to-Sequence Multi-Modal Speech In-Painting

被引:0
|
作者
Elyaderani, Mahsa Kadkhodaei [1 ]
Shirani, Shahram [1 ]
机构
[1] McMaster Univ, Dept Computat Sci & Engn, Hamilton, ON, Canada
来源
关键词
speech enhancement; speech in-painting; sequence-to-sequence models; multi-modality; Long Short-Term Memory networks; AUDIO; INTERPOLATION;
D O I
10.21437/Interspeech.2023-1848
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.
引用
收藏
页码:829 / 833
页数:5
相关论文
共 50 条
  • [31] Multi-modal Sequence to Sequence Learning with Content Attention for Hotspot Traffic Speed Prediction
    Liao, Binbing
    Tang, Siliang
    Yang, Shengwen
    Zhu, Wenwu
    Wu, Fei
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 212 - 222
  • [32] LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Mimura, Masato
    Ueno, Sei
    Inaguma, Hirofumi
    Sakai, Shinsuke
    Kawahara, Tatsuya
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 477 - 484
  • [33] Multi-modal Neural Networks for symbolic sequence pattern classification
    Zhu, HX
    Yoshihara, I
    Yamamori, K
    Yasunaga, M
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2004, E87D (07) : 1943 - 1952
  • [34] Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis
    Fernandez, Raul
    Haws, David
    Lorberbom, Guy
    Shechtman, Slava
    Sorin, Alexander
    INTERSPEECH 2022, 2022, : 5488 - 5492
  • [35] SEQUENCE-TO-SEQUENCE MODELLING OF F0 FOR SPEECH EMOTION CONVERSION
    Robinson, Carl
    Obin, Nicolas
    Roebel, Axel
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6830 - 6834
  • [36] Multi-modal Expression Recognition in the Wild Using Sequence Modeling
    Rasipuram, Sowmya
    Bhat, Junaid Hamid
    Maitra, Anutosh
    2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 629 - 631
  • [37] Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions
    Hannun, Awni
    Lee, Ann
    Xu, Qiantong
    Collobert, Ronan
    INTERSPEECH 2019, 2019, : 3785 - 3789
  • [38] ON SEQUENCE-TO-SEQUENCE TRANSFORMATIONS
    UPRETI, R
    INDIAN JOURNAL OF PURE & APPLIED MATHEMATICS, 1982, 13 (04): : 454 - 457
  • [39] Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition
    Novitasari, Sashi
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    INTERSPEECH 2019, 2019, : 3835 - 3839
  • [40] Detection and analysis of attention errors in sequence-to-sequence text-to-speech
    Valentini-Botinhao, Cassia
    King, Simon
    INTERSPEECH 2021, 2021, : 2746 - 2750