Sequence-to-Sequence Multi-Modal Speech In-Painting

被引:0
|
作者
Elyaderani, Mahsa Kadkhodaei [1 ]
Shirani, Shahram [1 ]
机构
[1] McMaster Univ, Dept Computat Sci & Engn, Hamilton, ON, Canada
来源
关键词
speech enhancement; speech in-painting; sequence-to-sequence models; multi-modality; Long Short-Term Memory networks; AUDIO; INTERPOLATION;
D O I
10.21437/Interspeech.2023-1848
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.
引用
收藏
页码:829 / 833
页数:5
相关论文
共 50 条
  • [21] Multi-modal Graph and Sequence Fusion Learning for Recommendation
    Wang, Zejun
    Wu, Xinglong
    Yang, Hongwei
    He, Hui
    Tai, Yu
    Zhang, Weizhe
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 357 - 369
  • [22] UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis
    Zhou, Xiao
    Ling, Zhen-Hua
    Dai, Li-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2643 - 2655
  • [23] SPEECH-TRANSFORMER: A NO-RECURRENCE SEQUENCE-TO-SEQUENCE MODEL FOR SPEECH RECOGNITION
    Dong, Linhao
    Xu, Shuang
    Xu, Bo
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5884 - 5888
  • [24] STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
    Chiu, Chung-Cheng
    Sainath, Tara N.
    Wu, Yonghui
    Prabhavalkar, Rohit
    Nguyen, Patrick
    Chen, Zhifeng
    Kannan, Anjuli
    Weiss, Ron J.
    Rao, Kanishka
    Gonina, Ekaterina
    Jaitly, Navdeep
    Li, Bo
    Chorowski, Jan
    Bacchiani, Michiel
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4774 - 4778
  • [25] COUPLED TRAINING OF SEQUENCE-TO-SEQUENCE MODELS FOR ACCENTED SPEECH RECOGNITION
    Unni, Vinit
    Joshi, Nitish
    Jyothi, Preethi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8254 - 8258
  • [26] MANDARIN ELECTROLARYNGEAL SPEECH VOICE CONVERSION WITH SEQUENCE-TO-SEQUENCE MODELING
    Yen, Ming-Chi
    Huang, Wen-Chin
    Kobayashi, Kazuhiro
    Peng, Yu-Huai
    Tsai, Shu-Wei
    Tsao, Yu
    Toda, Tomoki
    Jang, Jyh-Shing Roger
    Wang, Hsin-Min
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 650 - 657
  • [27] Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language
    Li, Huiyan
    Lin, Haohong
    Wang, You
    Wang, Hengyang
    Zhang, Ming
    Gao, Han
    Ai, Qing
    Luo, Zhiyuan
    Li, Guang
    BRAIN SCIENCES, 2022, 12 (07)
  • [28] FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Dai, Li-Rong
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4789 - 4793
  • [29] Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information
    Zhang, Weizhao
    Yang, Hongwu
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
  • [30] CORRECTION OF AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMER SEQUENCE-TO-SEQUENCE MODEL
    Hrinchuk, Oleksii
    Popova, Mariya
    Ginsburg, Boris
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7074 - 7078