Research on Voice Activity Detection Methods Based on Deep Learning

被引：0

作者：

Bai, Ke ^{[1
]}

Yan, Huaicheng ^{[1
]}

Li, Hao ^{[1
]}

Tang, Nanxi ^{[1
]}

Sun, Jiazheng ^{[1
]}

Li, Zhichen ^{[1
]}

机构：

[1] East China Univ Sci & Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China

来源：

2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024 | 2024年

关键词：

Voice Activity Detection; Convolutional Neural Network; Long Short-Term Memory network; Attention Mechanism; ALGORITHM;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Voice Activity Detection (VAD), as a crucial component of the speech processing, distinguishes between speech and non-speech segments within a voice. By accurately identifying moments of speech, it enhances the efficiency and performance of speech processing, reducing the wastage of resources on non-speech segments. This paper introduces a deep learning-based end-to-end trained VAD model that ingests Log-Mel features and combines Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory networks (BiLSTM), incorporating an attention mechanism to refine the selection and extraction of speech features. We compared three baseline models proposed on the AVA-Speech dataset and validated the enhancement in model performance due to the chosen sequence data processing network and the integration of the attention module through ablation studies. Results on the AVA-Speech dataset demonstrate that our method achieves an ACC of 90% and an AUC of 0.9439, outperforming other models and effectively fulfilling the target task.

引用

页码：1323 / 1328

页数：6

共 14 条

[1] Armani L., 2003, Proc. EUROSPEECH 2003, P501
[2] A soft voice activity detector based on a Laplacian-Gaussian model
Gazor, S
Zhang, W
[J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (05): : 498 - 505
[3] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Gu, Chunhui
Sun, Chen
Ross, David A.
Vondrick, Carl
Pantofaru, Caroline
Li, Yeqing
Vijayanarasimhan, Sudheendra
Toderici, George
Ricco, Susanna
Sukthankar, Rahul
Schmid, Cordelia
Malik, Jitendra
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056
[4] Hughes T, 2013, INT CONF ACOUST SPEE, P7378, DOI 10.1109/ICASSP.2013.6639096
[5] Huzaifah M., 2017, arXiv
[6] Analysis and improvement of the latency-based congestion control algorithm DX
Jiang, Wanchun
Li, Haoyang
Peng, Lijuan
Wu, Jia
Ruan, Chang
Wang, Jianxin
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 123 : 206 - 218
[7] Gradient-based learning applied to document recognition
Lecun, Y
Bottou, L
Bengio, Y
Haffner, P
[J]. PROCEEDINGS OF THE IEEE, 1998, 86 (11) : 2278 - 2324
[8] Neil D., 2016, NIPS, DOI 10.5555/3157382.3157532
[9] Obuchi Y, 2016, INT CONF ACOUST SPEE, P5715, DOI 10.1109/ICASSP.2016.7472772
[10] A statistical model-based voice activity detection
Sohn, J
Kim, NS
Sung, W
[J]. IEEE SIGNAL PROCESSING LETTERS, 1999, 6 (01) : 1 - 3

← 1 2 →