Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

被引:2
作者
Ai, Yang [1 ]
Ling, Zhen-Hua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei 230027, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech phase prediction; parallel estimation architecture; anti-wrapping loss; low-latency; speech generation; RECONSTRUCTION; RETRIEVAL; VOCODER;
D O I
10.1109/TASLP.2024.3385285
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
引用
收藏
页码:2283 / 2296
页数:14
相关论文
共 44 条
[1]  
Ai Y., 2023, IEEE INT CONFACOUST, P1
[2]   APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra [J].
Ai, Yang ;
Ling, Zhen-Hua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 :2145-2157
[3]   Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis [J].
Ai, Yang ;
Ling, Zhen-Hua ;
Wu, Wei-Lu ;
Li, Ang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :2036-2048
[4]   A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis [J].
Ai, Yang ;
Ling, Zhen-Hua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (839-851) :839-851
[5]  
[Anonymous], 1939, Bell Labs Record
[6]   Finding best approximation pairs relative to two closed convex sets in Hilbert spaces [J].
Bauschke, HH ;
Combettes, PL ;
Luke, DR .
JOURNAL OF APPROXIMATION THEORY, 2004, 127 (02) :178-192
[7]   Hybrid projection-reflection method for phase retrieval [J].
Bauschke, HH ;
Combettes, PL ;
Luke, DR .
JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A-OPTICS IMAGE SCIENCE AND VISION, 2003, 20 (06) :1025-1034
[8]   Phase retrieval, error reduction algorithm, and Fienup variants: a view from convex optimization [J].
Bauschke, HH ;
Combettes, PL ;
Luke, DR .
JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A-OPTICS IMAGE SCIENCE AND VISION, 2002, 19 (07) :1334-1345
[9]  
Buchholz S, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P3060
[10]  
Dauphin YN, 2017, PR MACH LEARN RES, V70