F0 Estimation and Voicing Detection With Cascade Architecture in Noisy Speech

被引：2

作者：

Zhang, Yixuan ^{[1
]}

Wang, Heming ^{[1
]}

Wang, Deliang ^{[2
,3
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Estimation; Noise measurement; Multitasking; Speech enhancement; Convolution; Training; Speech processing; Complex domain processing; densely-connected convolutional recurrent neural network; multi-task learning; neural cascade architecture; pitch tracking; voicing detection; MULTIPITCH TRACKING; PITCH; ALGORITHM; MASKING; ROBUST;

D O I：

10.1109/TASLP.2023.3313427

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.

引用

页码：3760 / 3770

页数：11

共 39 条

[31] CONCURRENT ESTIMATION OF SINGING VOICE F0 AND PHONEMES BY USING SPECTRAL ENVELOPES ESTIMATED FROM POLYPHONIC MUSIC
Fujihara, Hiromasa
Goto, Masataka
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 365 - 368
[32] Robust F0 Estimation Based on Log-Time Scale Autocorrelation and Its Application to Mandarin Tone Recognition
Kida, Yusuke
Sakai, Masaru
Masuko, Takashi
Kawamura, Akinori
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2931 - 2934
[33] WAVELET-BASED DECOMPOSITION OF F0 AS A SECONDARY TASK FOR DNN-BASED SPEECH SYNTHESIS WITH MULTI-TASK LEARNING
Ribeiro, Manuel Sam
Watts, Oliver
Yamagishi, Junichi
Clark, Robert A. J.
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5525 - 5529
[34] A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis
Chunwijitra, Vataya
Nose, Takashi
Kobayashi, Takao
SPEECH COMMUNICATION, 2012, 54 (02) : 245 - 255
[35] Use of semantic context and F0 contours by older listeners during Mandarin speech recognition in quiet and single-talker interference conditions
Jiang, Wei
Li, Yu
Shu, Hua
Zhang, Linjun
Zhang, Yang
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (04) : EL338 - EL344
[36] Effects of Hearing Loss on School-Aged Children's Ability to Benefit From F0 Differences Between Target and Masker Speech
Flaherty, Mary M.
Browning, Jenna
Buss, Emily
Leibold, Lori J.
EAR AND HEARING, 2021, 42 (04) : 1084 - 1096
[37] Music intervals in speech: Psychological disposition modulates ratio precision among interlocutors' nonlocal f0 production in real-time dyadic conversation
Robledo, Juan P.
Hurtado, Esteban
Prado, Felipe
Roman, Domingo
Cornejo, Carlos
PSYCHOLOGY OF MUSIC, 2016, 44 (06) : 1404 - 1418
[38] NOISE-ROBUST F0 ESTIMATION USING SNR-WEIGHTED SUMMARY CORRELOGRAMS FROM MULTI-BAND COMB FILTERS
Tan, Lee Ngee
Alwan, Abeer
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4464 - 4467
[39] Tandem-straight: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation
Kawahara, H.
Morise, M.
Takahashi, T.
Nisimura, R.
Irino, T.
Banno, H.
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 3933 - +

← 1 2 3 4 →