Sequence-to-Sequence Multi-Modal Speech In-Painting

被引：0

作者：

Elyaderani, Mahsa Kadkhodaei ^{[1
]}

Shirani, Shahram ^{[1
]}

机构：

[1] McMaster Univ, Dept Computat Sci & Engn, Hamilton, ON, Canada

来源：

INTERSPEECH 2023 | 2023年

关键词：

speech enhancement; speech in-painting; sequence-to-sequence models; multi-modality; Long Short-Term Memory networks; AUDIO; INTERPOLATION;

D O I：

10.21437/Interspeech.2023-1848

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.

引用

页码：829 / 833

页数：5

共 50 条

[21] Multi-modal Graph and Sequence Fusion Learning for Recommendation
Wang, Zejun
Wu, Xinglong
Yang, Hongwei
He, Hui
Tai, Yu
Zhang, Weizhe
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 357 - 369
[22] UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis
Zhou, Xiao
Ling, Zhen-Hua
Dai, Li-Rong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2643 - 2655
[23] SPEECH-TRANSFORMER: A NO-RECURRENCE SEQUENCE-TO-SEQUENCE MODEL FOR SPEECH RECOGNITION
Dong, Linhao
Xu, Shuang
Xu, Bo
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5884 - 5888
[24] STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
Chiu, Chung-Cheng
Sainath, Tara N.
Wu, Yonghui
Prabhavalkar, Rohit
Nguyen, Patrick
Chen, Zhifeng
Kannan, Anjuli
Weiss, Ron J.
Rao, Kanishka
Gonina, Ekaterina
Jaitly, Navdeep
Li, Bo
Chorowski, Jan
Bacchiani, Michiel
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4774 - 4778
[25] COUPLED TRAINING OF SEQUENCE-TO-SEQUENCE MODELS FOR ACCENTED SPEECH RECOGNITION
Unni, Vinit
Joshi, Nitish
Jyothi, Preethi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8254 - 8258
[26] MANDARIN ELECTROLARYNGEAL SPEECH VOICE CONVERSION WITH SEQUENCE-TO-SEQUENCE MODELING
Yen, Ming-Chi
Huang, Wen-Chin
Kobayashi, Kazuhiro
Peng, Yu-Huai
Tsai, Shu-Wei
Tsao, Yu
Toda, Tomoki
Jang, Jyh-Shing Roger
Wang, Hsin-Min
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 650 - 657
[27] Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language
Li, Huiyan
Lin, Haohong
Wang, You
Wang, Hengyang
Zhang, Ming
Gao, Han
Ai, Qing
Luo, Zhiyuan
Li, Guang
BRAIN SCIENCES, 2022, 12 (07)
[28] FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS
Zhang, Jing-Xuan
Ling, Zhen-Hua
Dai, Li-Rong
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4789 - 4793
[29] Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information
Zhang, Weizhao
Yang, Hongwu
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
[30] CORRECTION OF AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMER SEQUENCE-TO-SEQUENCE MODEL
Hrinchuk, Oleksii
Popova, Mariya
Ginsburg, Boris
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7074 - 7078

← 1 2 3 4 5 →