Statistical Regression Models for Noise Robust F0 Estimation Using Recurrent Deep Neural Networks

被引:5
|
作者
Kato, Akihiro [1 ]
Kinnunen, Tomi H. [2 ]
机构
[1] Ricoh Co Ltd, Ricoh Inst Technol, Ebina, Kanagawa 2430460, Japan
[2] Univ Eastern Finland, Sch Comp, FI-80101 Joensuu, Finland
基金
芬兰科学院;
关键词
Estimation; Hidden Markov models; Speech processing; Noise robustness; Task analysis; Recurrent neural networks; Fundamental frequency; F0; pitch; waveform-to-sinusoid regression; regression model; recurrent neural networks; FUNDAMENTAL-FREQUENCY; MULTIPITCH TRACKING; SPEECH; PERFORMANCE; PREDICTION; ALGORITHM; LSTM;
D O I
10.1109/TASLP.2019.2945489
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The fundamental frequency (F0) in a speech signal, which corresponds to pitch, is one of the key features involved in a variety of speech processing tasks. Therefore, accurate F0 estimation has remained an important problem to be solved over decades. However, this problem is difficult, especially in low signal-to-noise ratio (SNR) conditions with unknown noise. In this work, we propose new approaches to noise-robust F0 estimation using recurrent neural networks (RNNs). Recent F0 estimation studies exploit deep neural networks (DNNs), including RNNs, to classify acoustic features into quantized frequency states. In contrast to these classification approaches, we put forward a regression method for F0 tracking, which is accomplished with RNNs. To this end, we propose two variants. Our first model predicts the (scalar) F0 value directly from a spectrum, while our second model predicts a target sinusoidal waveform (with the desired F0) from the raw speech waveform. Our experiments with the pitch tracking database from Graz University of Technology (PTDB-TUG), contaminated by additive noise (NOISEX-92), demonstrate the improvement of the proposed approaches in terms of the gross pitch error (GPE) and fine pitch error (FPE) rates by more than 35 at SNRs between -10dB and 10dB against a well-known, noise-robust F0 tracker, PEFAC. Furthermore, our methods outperform state-of-the-art neural network-based approaches by more than 15 in terms of both the FPE and GPE rates over the abovementioned SNR range.
引用
收藏
页码:2336 / 2349
页数:14
相关论文
共 50 条
  • [1] Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks
    Kato, Akihiro
    Kinnunen, Tomi
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 327 - 331
  • [2] Direct F0 Estimation with Neural-Network-based Regression
    Xu, Shuzhuang
    Shimodaira, Hiroshi
    INTERSPEECH 2019, 2019, : 1995 - 1999
  • [3] Emotional Voice Conversion Using Deep Neural Networks with MCC and F0 Features
    Luo, Zhaojie
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2016 IEEE/ACIS 15TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2016, : 977 - 981
  • [4] Modeling F0 trajectories in hierarchically structured deep neural networks
    Yin, Xiang
    Lei, Ming
    Qian, Yao
    Soong, Frank K.
    He, Lei
    Ling, Zhen-Hua
    Dai, Li-Rong
    SPEECH COMMUNICATION, 2016, 76 : 82 - 92
  • [5] Noise robust speech recognition using F0 contour information
    Iwano, K
    Seki, T
    Furui, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2004, E87D (05): : 1102 - 1109
  • [6] Multiband statistical learning for F0 estimation in speech
    Sha, F
    Burgoyne, JA
    Saul, LK
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 661 - 664
  • [7] ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION
    Kurth, Frank
    Cornaggia-Urrigshardt, Alessia
    Urrigshardt, Sebastian
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [8] Robust F0 Modeling for Mandarin Speech Recognition in Noise
    Qiang, Sheng
    Qian, Yao
    Soong, Frank K.
    Xu, Congfu
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1101 - +
  • [9] Robust F0 estimation using ELS-based robust complex speech analysis
    Funaki, Keiichi
    Kinjo, Tatsuhiko
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2008, E91A (03) : 868 - 871
  • [10] Whisper to Normal Speech Based on Deep Neural Networks with MCC and F0 Features
    Lian, Hailun
    Hu, Yuting
    Zhou, Jian
    Wang, Huabin
    Tao, Liang
    2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2018,