F0 Estimation and Voicing Detection With Cascade Architecture in Noisy Speech

被引:2
作者
Zhang, Yixuan [1 ]
Wang, Heming [1 ]
Wang, Deliang [2 ,3 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
关键词
Estimation; Noise measurement; Multitasking; Speech enhancement; Convolution; Training; Speech processing; Complex domain processing; densely-connected convolutional recurrent neural network; multi-task learning; neural cascade architecture; pitch tracking; voicing detection; MULTIPITCH TRACKING; PITCH; ALGORITHM; MASKING; ROBUST;
D O I
10.1109/TASLP.2023.3313427
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.
引用
收藏
页码:3760 / 3770
页数:11
相关论文
共 39 条
  • [31] CONCURRENT ESTIMATION OF SINGING VOICE F0 AND PHONEMES BY USING SPECTRAL ENVELOPES ESTIMATED FROM POLYPHONIC MUSIC
    Fujihara, Hiromasa
    Goto, Masataka
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 365 - 368
  • [32] Robust F0 Estimation Based on Log-Time Scale Autocorrelation and Its Application to Mandarin Tone Recognition
    Kida, Yusuke
    Sakai, Masaru
    Masuko, Takashi
    Kawamura, Akinori
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2931 - 2934
  • [33] WAVELET-BASED DECOMPOSITION OF F0 AS A SECONDARY TASK FOR DNN-BASED SPEECH SYNTHESIS WITH MULTI-TASK LEARNING
    Ribeiro, Manuel Sam
    Watts, Oliver
    Yamagishi, Junichi
    Clark, Robert A. J.
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5525 - 5529
  • [34] A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis
    Chunwijitra, Vataya
    Nose, Takashi
    Kobayashi, Takao
    SPEECH COMMUNICATION, 2012, 54 (02) : 245 - 255
  • [35] Use of semantic context and F0 contours by older listeners during Mandarin speech recognition in quiet and single-talker interference conditions
    Jiang, Wei
    Li, Yu
    Shu, Hua
    Zhang, Linjun
    Zhang, Yang
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (04) : EL338 - EL344
  • [36] Effects of Hearing Loss on School-Aged Children's Ability to Benefit From F0 Differences Between Target and Masker Speech
    Flaherty, Mary M.
    Browning, Jenna
    Buss, Emily
    Leibold, Lori J.
    EAR AND HEARING, 2021, 42 (04) : 1084 - 1096
  • [37] Music intervals in speech: Psychological disposition modulates ratio precision among interlocutors' nonlocal f0 production in real-time dyadic conversation
    Robledo, Juan P.
    Hurtado, Esteban
    Prado, Felipe
    Roman, Domingo
    Cornejo, Carlos
    PSYCHOLOGY OF MUSIC, 2016, 44 (06) : 1404 - 1418
  • [38] NOISE-ROBUST F0 ESTIMATION USING SNR-WEIGHTED SUMMARY CORRELOGRAMS FROM MULTI-BAND COMB FILTERS
    Tan, Lee Ngee
    Alwan, Abeer
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4464 - 4467
  • [39] Tandem-straight: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation
    Kawahara, H.
    Morise, M.
    Takahashi, T.
    Nisimura, R.
    Irino, T.
    Banno, H.
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 3933 - +