Robust Methods for Text-Dependent Speaker Verification

被引:1
作者
Bhukya, Ramesh K. [1 ]
Prasanna, S. R. Mahadeva [1 ,2 ]
Sarma, Biswajit Dev [3 ]
机构
[1] Indian Inst Technol Guwahati, Dept Elect & Elect Engn, Electro Med & Speech Technol Lab, Gauhati 781039, India
[2] Indian Inst Technol Dharwad, Dept Elect Engn, Dharwad 580011, Karnataka, India
[3] Bay Area Adv Analyt India P Ltd, Gauhati 781039, India
关键词
End point detection; VLRs; Dominant resonant frequency; Glottal activity detection; Foreground speech segmentation; MEMD; IMFs; Hilbert spectrum; MFCCs; TDSV; DTW; EMPIRICAL MODE DECOMPOSITION; END-POINT DETECTION; SPEECH; RECOGNITION; VOWEL;
D O I
10.1007/s00034-019-01125-x
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this work, we explore various noise robust techniques at different stages of a Text-Dependent Speaker Verification (TDSV) system. A speech-specific knowledge-based robust end points detection technique is used for noise compensation at signal level. Feature-level compensation is done by using robust features extracted from Hilbert Spectrum (HS) of the Intrinsic Mode Functions obtained from Modified Empirical Mode Decomposition of speech. We also explored a combined temporal and spectral speech enhancement technique prior to the end points detection for enhancing speech regions embedded in noise. All experimental studies are conducted using two databases, namely the RSR2015 and the IITG database. It is found that the use of robust end points detection improves the performance of the TDSV system compared to the energy-based end points detection in both clean and degraded speech conditions. Use of noise robust HS features augmented with Mel-frequency cepstral coefficients further improves the performance of the system. It is also found that the use of speech enhancement prior to signal and feature-level compensation results in further improvement in performance for the low SNR cases. The final combined system obtained by using three robust methods provides a relative improvement from 6 to 25% in terms of the EER, on the RSR2015 database corrupted with Babble noise of varying strength and by around from 30 to 45% relative improvement on the IITG database.
引用
收藏
页码:5253 / 5288
页数:36
相关论文
共 58 条
  • [1] Further intelligibility results from human listening tests using the short-time phase spectrum
    Alsteris, Leigh D.
    Paliwal, Kuldip K.
    [J]. SPEECH COMMUNICATION, 2006, 48 (06) : 727 - 736
  • [2] [Anonymous], 1998, EMPIRICAL MODE
  • [3] [Anonymous], 1997, TECHNICAL REPORT
  • [4] Spectro-temporal analysis of speech signals using zero-time windowing and group delay function
    Bayya, Yegnanarayana
    Gowda, Dhananjaya N.
    [J]. SPEECH COMMUNICATION, 2013, 55 (06) : 782 - 795
  • [5] Beigi H., 2012, SPEAKER RECOGNITION
  • [6] Bhattacharjee D, 2017, AAAI CONF ARTIF INTE, P17
  • [7] End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification
    Bhukya, Ramesh K.
    Sarma, Biswajit Dev
    Prasanna, S. R. Mahadeva
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2018, 37 (12) : 5507 - 5539
  • [8] Biagetti G., 2017, INT C INT DEC TECHN, P43
  • [9] An Investigation on the Accuracy of Truncated DKLT Representation for Speaker Identification With Short Sequences of Speech Frames
    Biagetti, Giorgio
    Crippa, Paolo
    Falaschetti, Laura
    Orcioni, Simone
    Turchetti, Claudio
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (12) : 4235 - 4249
  • [10] Boril H., 2006, 9 INT C SPOK LANG PR