Research on Voice Activity Detection Methods Based on Deep Learning

被引:0
作者
Bai, Ke [1 ]
Yan, Huaicheng [1 ]
Li, Hao [1 ]
Tang, Nanxi [1 ]
Sun, Jiazheng [1 ]
Li, Zhichen [1 ]
机构
[1] East China Univ Sci & Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China
来源
2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024 | 2024年
关键词
Voice Activity Detection; Convolutional Neural Network; Long Short-Term Memory network; Attention Mechanism; ALGORITHM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Voice Activity Detection (VAD), as a crucial component of the speech processing, distinguishes between speech and non-speech segments within a voice. By accurately identifying moments of speech, it enhances the efficiency and performance of speech processing, reducing the wastage of resources on non-speech segments. This paper introduces a deep learning-based end-to-end trained VAD model that ingests Log-Mel features and combines Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory networks (BiLSTM), incorporating an attention mechanism to refine the selection and extraction of speech features. We compared three baseline models proposed on the AVA-Speech dataset and validated the enhancement in model performance due to the chosen sequence data processing network and the integration of the attention module through ablation studies. Results on the AVA-Speech dataset demonstrate that our method achieves an ACC of 90% and an AUC of 0.9439, outperforming other models and effectively fulfilling the target task.
引用
收藏
页码:1323 / 1328
页数:6
相关论文
共 14 条
  • [1] Armani L., 2003, Proc. EUROSPEECH 2003, P501
  • [2] A soft voice activity detector based on a Laplacian-Gaussian model
    Gazor, S
    Zhang, W
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (05): : 498 - 505
  • [3] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
    Gu, Chunhui
    Sun, Chen
    Ross, David A.
    Vondrick, Carl
    Pantofaru, Caroline
    Li, Yeqing
    Vijayanarasimhan, Sudheendra
    Toderici, George
    Ricco, Susanna
    Sukthankar, Rahul
    Schmid, Cordelia
    Malik, Jitendra
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056
  • [4] Hughes T, 2013, INT CONF ACOUST SPEE, P7378, DOI 10.1109/ICASSP.2013.6639096
  • [5] Huzaifah M., 2017, arXiv
  • [6] Analysis and improvement of the latency-based congestion control algorithm DX
    Jiang, Wanchun
    Li, Haoyang
    Peng, Lijuan
    Wu, Jia
    Ruan, Chang
    Wang, Jianxin
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 123 : 206 - 218
  • [7] Gradient-based learning applied to document recognition
    Lecun, Y
    Bottou, L
    Bengio, Y
    Haffner, P
    [J]. PROCEEDINGS OF THE IEEE, 1998, 86 (11) : 2278 - 2324
  • [8] Neil D., 2016, NIPS, DOI 10.5555/3157382.3157532
  • [9] Obuchi Y, 2016, INT CONF ACOUST SPEE, P5715, DOI 10.1109/ICASSP.2016.7472772
  • [10] A statistical model-based voice activity detection
    Sohn, J
    Kim, NS
    Sung, W
    [J]. IEEE SIGNAL PROCESSING LETTERS, 1999, 6 (01) : 1 - 3