WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

被引：2

作者：

Rekimoto, Jun ^{[1
,2
]}

机构：

[1] Univ Tokyo, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan

[2] Sony Comp Sci Labs Kyoto, 13-1 Hontoro Cho,Shimogyo Ku, Kyoto, Kyoto, Japan

来源：

PROCEEDINGS OF THE 2023 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2023 | 2023年

关键词：

speech interaction; whispered voice; whispered voice conversion; silent speech; artificial intelligence; neural networks; self-supervised learning;

D O I：

10.1145/3544548.3580706

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech ( UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.

引用

页数：12

共 50 条

[41] ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech
Ghorbani, Saeed
Ferstl, Ylva
Holden, Daniel
Troje, Nikolaus F.
Carbonneau, Marc-Andre
COMPUTER GRAPHICS FORUM, 2023, 42 (01) : 206 - 216
[42] Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
Lei, Yi
Yang, Shan
Cong, Jian
Xie, Lei
Su, Dan
INTERSPEECH 2022, 2022, : 2563 - 2567
[43] DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
Yuan, Ruibin
Wu, Yuxuan
Li, Jacob
Kim, Jaxter
INTERSPEECH 2022, 2022, : 2593 - 2597
[44] ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS
Li, Jiaxin
Zhang, Lianhai
ELECTRONICS, 2023, 12 (04)
[45] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Tang, Chuanxin
Luo, Chong
Zhao, Zhiyuan
Yin, Dacheng
Zhao, Yucheng
Zeng, Wenjun
INTERSPEECH 2021, 2021, : 3600 - 3604
[46] StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
Wang, Zhichao
Chen, Yuanzhe
Wang, Xinsheng
Xie, Lei
Wang, Yuping
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7328 - 7338
[47] LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance
Chen, Shihao
Gu, Yu
Zhang, Jie
Li, Na
Chen, Rilin
Chen, Liping
Dai, Lirong
INTERSPEECH 2024, 2024, : 2770 - 2774
[48] SIG-VC: A SPEAKER INFORMATION GUIDED ZERO-SHOT VOICE CONVERSION SYSTEM FOR BOTH HUMAN BEINGS AND MACHINES
Zhang, Haozhe
Cai, Zexin
Qin, Xiaoyi
Li, Ming
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6567 - 6571
[49] Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
Xu, Le
Zhong, Rongxiu
Liu, Ying
Yang, Huibao
Zhang, Shilei
INTERSPEECH 2023, 2023, : 2293 - 2297
[50] Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers
Long, Ziang
Zheng, Yunling
Yu, Meng
Xin, Jack
2022 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR INDUSTRIES, AI4I, 2022, : 59 - 63

← 1 2 3 4 5 →