SPEECH ACTIVITY DETECTION IN ONLINE BROADCAST TRANSCRIPTION USING DEEP NEURAL NETWORKS AND WEIGHTED FINITE STATE TRANSDUCERS

被引:0
作者
Mateju, Lukas [1 ]
Cerva, Petr [1 ]
Zdansky, Jindrich [1 ]
Malek, Jiri [1 ]
机构
[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studie, Studentska 2, Liberec 46117, Czech Republic
来源
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2017年
关键词
deep neural networks; speech activity detection; weighted finite state transducers; speech recognition; VOICE;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, a new approach to online Speech Activity Detection (SAD) is proposed. This approach is designed for the use in a system that carries out 24/7 transcription of radio/TV broadcasts containing a large amount of non-speech segments, such as advertisements or music. To improve the robustness of detection, we adopt Deep Neural Networks (DNNs) trained on artificially created mixtures of speech and non-speech signals at desired levels of signal-to-noise ratio (SNR). An integral part of our approach is an online decoder based on Weighted Finite State Transducers (WFSTs); this decoder smooths the output from DNN. The employed transduction model is context-based, i.e., both speech and non-speech events are modeled using sequences of states. The presented experimental results show that our approach yields state-of-the-art results on standardized QUT-NOISE-TIMIT data set for SAD and, at the same time, it is capable of a) operating with low latency and b) reducing the computational demands and error rate of the target transcription system.
引用
收藏
页码:5460 / 5464
页数:5
相关论文
共 30 条
  • [1] [Anonymous], 2005, INTERSPEECH 2005
  • [2] [Anonymous], 2009, INTERSPEECH 2009
  • [3] [Anonymous], P INTERSPEECH
  • [4] ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications
    Benyassine, A
    Shlomot, E
    Su, HY
    Massaloux, D
    Lamblin, C
    Petit, JP
    [J]. IEEE COMMUNICATIONS MAGAZINE, 1997, 35 (09) : 64 - 73
  • [5] Chung H, 2013, INTERSPEECH, P700
  • [6] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [7] Dean D, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3110
  • [8] Eyben F, 2013, INT CONF ACOUST SPEE, P483, DOI 10.1109/ICASSP.2013.6637694
  • [9] Fisher W., 1986, PROC DARPA WORKSHOP, P93
  • [10] Gao C., 2011, INTERSPEECH, P2637