Robust DNN-based VAD augmented with phone entropy based rejection of background speech

被引:3
|
作者
Fujita, Yuya [1 ]
Iso, Ken-ichi [1 ]
机构
[1] Yahoo Japan Corp, Tokyo, Japan
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
关键词
Voice Activity Detection; Deep Neural Network; Entropy; VOICE ACTIVITY DETECTION;
D O I
10.21437/Interspeech.2016-136
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a DNN-based voice activity detector augmented by entropy based frame rejection. DNN-based VAD classifies a frame into speech or non-speech and achieves significantly higher VAD performance compared to conventional statistical model-based VAD. We observed that many of the remaining errors are false alarms caused by background human speech, such as TV / radio or surrounding peoples' conversations. In order to reject such background speech frames, we introduce an entropy based confidence measure using the phone posterior probability output by a DNN-based acoustic model. Compared to the target speaker's voice background speech tends to have relatively unclear pronunciation or is contaminated by other types of noises so its entropy becomes larger than audio signals with only the target speaker's voice. Combining DNN-based VAD and the entropy criterion, we reject speech frames classified by the DNN-based VAD as having an entropy larger than a threshold value. We have evaluated the proposed approach and confirmed greater than 10% reduction in Sentence Error Rate.
引用
收藏
页码:3663 / 3667
页数:5
相关论文
共 50 条
  • [21] DNN-Based Unit Selection Using Frame-Sized Speech Segments
    Zhou, Zhi-Ping
    Ling, Zhen-Hua
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [22] DNN-Based Voice Activity Detection with Multi-Task Learning
    Kang, Tae Gyoon
    Kim, Nam Soo
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (02): : 550 - 553
  • [23] DNN-based Amplitude and Phase Feature Enhancement for Noise Robust Speaker Identification
    Oo, Zeyan
    Kawakami, Yuta
    Wang, Longbiao
    Nakagawa, Seiichi
    Xiao, Xiong
    Iwahashi, Masahiro
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2204 - 2208
  • [24] Investigating Effective Additional Contextual Factors in DNN-based Spontaneous Speech Synthesis
    Yamashita, Yuki
    Koriyama, Tomoki
    Saito, Yuki
    Takamichi, Shinnosuke
    Ijima, Yusuke
    Masumura, Ryo
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 3201 - 3205
  • [25] INTEGRATING DNN-BASED AND SPATIAL CLUSTERING-BASED MASK ESTIMATION FOR ROBUST MVDR BEAMFORMING
    Nakatani, Tomohiro
    To, Nobutaka
    Higuchi, Takuya
    Araki, Shoko
    Kinoshita, Keisuke
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 286 - 290
  • [26] Uncertainty decoding with adaptive sampling for noise robust DNN-based acoustic modeling
    Tran, Dung T.
    Delcroix, Marc
    Ogawa, Atsunori
    Nakatani, Tomohiro
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3852 - 3856
  • [27] Towards More Efficient DNN-Based Speech Enhancement Using Quantized Correlation Mask
    Abdullah, Salinna
    Zamani, Majid
    Demosthenous, Andreas
    IEEE ACCESS, 2021, 9 : 24350 - 24362
  • [28] AN INVESTIGATION OF AUGMENTING SPEAKER REPRESENTATIONS TO IMPROVE SPEAKER NORMALISATION FOR DNN-BASED SPEECH RECOGNITION
    Huang, Hengguan
    Sim, Khe Chai
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4610 - 4613
  • [29] Improved Time-Frequency Trajectory Excitation Vocoder for DNN-Based Speech Synthesis
    Song, Eunwoo
    Soong, Frank K.
    Kang, Hong-Goo
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2253 - 2257
  • [30] DNN-based speaker clustering for speaker diarisation
    Milner, Rosanna
    Hain, Thomas
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2185 - 2189