WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

被引:2
|
作者
Rekimoto, Jun [1 ,2 ]
机构
[1] Univ Tokyo, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[2] Sony Comp Sci Labs Kyoto, 13-1 Hontoro Cho,Shimogyo Ku, Kyoto, Kyoto, Japan
关键词
speech interaction; whispered voice; whispered voice conversion; silent speech; artificial intelligence; neural networks; self-supervised learning;
D O I
10.1145/3544548.3580706
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech ( UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech
    Ghorbani, Saeed
    Ferstl, Ylva
    Holden, Daniel
    Troje, Nikolaus F.
    Carbonneau, Marc-Andre
    COMPUTER GRAPHICS FORUM, 2023, 42 (01) : 206 - 216
  • [42] Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
    Lei, Yi
    Yang, Shan
    Cong, Jian
    Xie, Lei
    Su, Dan
    INTERSPEECH 2022, 2022, : 2563 - 2567
  • [43] DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
    Yuan, Ruibin
    Wu, Yuxuan
    Li, Jacob
    Kim, Jaxter
    INTERSPEECH 2022, 2022, : 2593 - 2597
  • [44] ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS
    Li, Jiaxin
    Zhang, Lianhai
    ELECTRONICS, 2023, 12 (04)
  • [45] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Yin, Dacheng
    Zhao, Yucheng
    Zeng, Wenjun
    INTERSPEECH 2021, 2021, : 3600 - 3604
  • [46] StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7328 - 7338
  • [47] LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance
    Chen, Shihao
    Gu, Yu
    Zhang, Jie
    Li, Na
    Chen, Rilin
    Chen, Liping
    Dai, Lirong
    INTERSPEECH 2024, 2024, : 2770 - 2774
  • [48] SIG-VC: A SPEAKER INFORMATION GUIDED ZERO-SHOT VOICE CONVERSION SYSTEM FOR BOTH HUMAN BEINGS AND MACHINES
    Zhang, Haozhe
    Cai, Zexin
    Qin, Xiaoyi
    Li, Ming
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6567 - 6571
  • [49] Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
    Xu, Le
    Zhong, Rongxiu
    Liu, Ying
    Yang, Huibao
    Zhang, Shilei
    INTERSPEECH 2023, 2023, : 2293 - 2297
  • [50] Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers
    Long, Ziang
    Zheng, Yunling
    Yu, Meng
    Xin, Jack
    2022 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR INDUSTRIES, AI4I, 2022, : 59 - 63