Statistical Regression Models for Noise Robust F0 Estimation Using Recurrent Deep Neural Networks

被引:5
|
作者
Kato, Akihiro [1 ]
Kinnunen, Tomi H. [2 ]
机构
[1] Ricoh Co Ltd, Ricoh Inst Technol, Ebina, Kanagawa 2430460, Japan
[2] Univ Eastern Finland, Sch Comp, FI-80101 Joensuu, Finland
基金
芬兰科学院;
关键词
Estimation; Hidden Markov models; Speech processing; Noise robustness; Task analysis; Recurrent neural networks; Fundamental frequency; F0; pitch; waveform-to-sinusoid regression; regression model; recurrent neural networks; FUNDAMENTAL-FREQUENCY; MULTIPITCH TRACKING; SPEECH; PERFORMANCE; PREDICTION; ALGORITHM; LSTM;
D O I
10.1109/TASLP.2019.2945489
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The fundamental frequency (F0) in a speech signal, which corresponds to pitch, is one of the key features involved in a variety of speech processing tasks. Therefore, accurate F0 estimation has remained an important problem to be solved over decades. However, this problem is difficult, especially in low signal-to-noise ratio (SNR) conditions with unknown noise. In this work, we propose new approaches to noise-robust F0 estimation using recurrent neural networks (RNNs). Recent F0 estimation studies exploit deep neural networks (DNNs), including RNNs, to classify acoustic features into quantized frequency states. In contrast to these classification approaches, we put forward a regression method for F0 tracking, which is accomplished with RNNs. To this end, we propose two variants. Our first model predicts the (scalar) F0 value directly from a spectrum, while our second model predicts a target sinusoidal waveform (with the desired F0) from the raw speech waveform. Our experiments with the pitch tracking database from Graz University of Technology (PTDB-TUG), contaminated by additive noise (NOISEX-92), demonstrate the improvement of the proposed approaches in terms of the gross pitch error (GPE) and fine pitch error (FPE) rates by more than 35 at SNRs between -10dB and 10dB against a well-known, noise-robust F0 tracker, PEFAC. Furthermore, our methods outperform state-of-the-art neural network-based approaches by more than 15 in terms of both the FPE and GPE rates over the abovementioned SNR range.
引用
收藏
页码:2336 / 2349
页数:14
相关论文
共 50 条
  • [41] AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION
    Seltzer, Michael L.
    Yu, Dong
    Wang, Yongqiang
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7398 - 7402
  • [42] Noise-Robust Detection of Whispering in Telephone Calls Using Deep Neural Networks
    Diment, Aleksandr
    Parviainen, Mikko
    Virtanen, Tuomas
    Zelov, Roman
    Glasman, Alex
    2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 2310 - 2314
  • [43] Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0
    Corkey, Niamh
    O'Mahony, Johannah
    King, Simon
    INTERSPEECH 2023, 2023, : 2014 - 2015
  • [44] F0 Contour Estimation using ELS-based Robust Time-Varying Complex Speech Analysis
    Funaki, Keiichi
    2011 IEEE DIGITAL SIGNAL PROCESSING WORKSHOP AND IEEE SIGNAL PROCESSING EDUCATION WORKSHOP (DSP/SPE), 2011, : 313 - 316
  • [45] Robust F0 estimation based on complex LPC analysis for IRS filtered noisy speech
    Funaki, Keiichi
    Kinjo, Tatsuhiko
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2007, E90A (08) : 1579 - 1586
  • [46] Robust fuzzy regression analysis using neural networks
    Nasrabadi, Ebrahim
    Hashemi, S. Mehdi
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2008, 16 (04) : 579 - 598
  • [47] Robust interval regression analysis using neural networks
    Huang, L
    Zhang, BL
    Huang, Q
    FUZZY SETS AND SYSTEMS, 1998, 97 (03) : 337 - 347
  • [48] F0 ESTIMATION USING BLIND SOURCE SEPARATION FOR ANALYZING NOH SINGING
    Tamoto, Atsuki
    Itou, Katunobu
    PROCEEDINGS OF THE 2020 IEEE 30TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2020,
  • [49] F0 CONTOUR ESTIMATION USING PHONETIC FEATURE IN ELECTROLARYNGEAL SPEECH ENHANCEMENT
    Cai, Zexin
    Xu, Zhicheng
    Li, Ming
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6490 - 6494
  • [50] A Study of F0 Estimation Based on RAPT Framework using Sustained Vowel
    Karunaimathi, Prarthana, V
    Gladis, Dennis
    Dalvi, Usha
    2015 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2015, : 2290 - 2295