WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

被引:2
|
作者
Rekimoto, Jun [1 ,2 ]
机构
[1] Univ Tokyo, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[2] Sony Comp Sci Labs Kyoto, 13-1 Hontoro Cho,Shimogyo Ku, Kyoto, Kyoto, Japan
关键词
speech interaction; whispered voice; whispered voice conversion; silent speech; artificial intelligence; neural networks; self-supervised learning;
D O I
10.1145/3544548.3580706
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech ( UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals
    Nercessian, Shahan
    INTERSPEECH 2020, 2020, : 4711 - 4715
  • [22] Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario
    Weng, Shao-En
    Shuai, Hong-Han
    Cheng, Wen-Huang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13718 - 13726
  • [23] Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
    Sheng, Zheng-Yan
    Ai, Yang
    Chen, Yan-Nian
    Ling, Zhen-Hua
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8443 - 8452
  • [24] Voicing decision based on phonemes classification and spectral moments for whisper-to-speech conversion
    Ardaillon, Luc
    Bernardoni, Nathalie Henrich
    Perrotin, Olivier
    INTERSPEECH 2022, 2022, : 2253 - 2257
  • [25] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
    Azizah, Kurniawati
    IEEE ACCESS, 2024, 12 : 63528 - 63547
  • [26] CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion
    Patel, Maitreya
    Purohit, Mirali
    Shah, Jui
    Patil, Havant A.
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 411 - 415
  • [27] ZERO-SHOT VOICE CONVERSION WITH ADJUSTED SPEAKER EMBEDDINGS AND SIMPLE ACOUSTIC FEATURES
    Tan, Zhiyuan
    Wei, Jianguo
    Xu, Junhai
    He, Yuqing
    Lu, Wenhuan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5964 - 5968
  • [28] DGC-VECTOR: A NEW SPEAKER EMBEDDING FOR ZERO-SHOT VOICE CONVERSION
    Xiao, Ruitong
    Zhang, Haitong
    Lin, Yue
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6547 - 6551
  • [29] END-TO-END ZERO-SHOT VOICE CONVERSION USING A DDSP VOCODER
    Nercessian, Shahan
    2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 306 - 310
  • [30] Towards Unseen Speakers Zero-Shot Voice Conversion with Generative Adversarial Networks
    Lu, Weirui
    Xing, Xiaofen
    Xu, Xiangmin
    Zhang, Weibin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 854 - 858