Two-stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-token Connectionist Temporal Classification

被引:3
作者
Park, Inyoung [1 ]
Kim, Hong Kook [1 ,2 ]
机构
[1] Gwangju Inst Sci & Technol, AI Grad Sch, Gwangju 61005, South Korea
[2] Gwangju Inst Sci & Technol, Sch Elect Engn & Comp Sci, Gwangju 61005, South Korea
来源
INTERSPEECH 2020 | 2020年
关键词
polyphonic sound event detection (SED); faster regional convolutional neural network (R-CNN); multi-token connectionist temporal classification (Multi-token CTC); attention long short-term memory (attention-LSTM);
D O I
10.21437/Interspeech.2020-3097
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We propose a two-stage sound event detection (SED) model to deal with sound events overlapping in time-frequency. In the first stage which consists of a faster R-CNN and an attention-LSTM, each log-mel spectrogram segment is divided into one or more proposed regions (PRs) according to the coordinates of a region proposal network. To efficiently train polyphonic sound, we take only one PR for each sound event from a bounding box regressor associated with the attention-LSTM. In the second stage, the original input image and the difference image between adjacent segments are separately pooled according to the coordinate of each PR predicted in the first stage. Then, two feature maps using CNNs are concatenated and processed further by LSTM. Finally, CTC-based n-best SED is conducted using the softmax output from the CNN-LSTM, where CTC has two tokens for each event so that the start and ending time frames are accurately detected. Experiments on SED using DCASE 2019 Task 3 show that the proposed two-stage model with multi-token CTC achieves an F1-score of 97.5%, while the first stage alone and the two-stage model with a conventional CTC yield F1-scores of 91.9% and 95.6%, respectively.
引用
收藏
页码:856 / 860
页数:5
相关论文
共 20 条
[1]  
Adavanne S., 2019, DCASE 2019 DETECTION
[2]  
[Anonymous], DCASE 2018 CHALLENGE
[3]  
[Anonymous], DCASE 2019 CHALLENGE
[4]  
[Anonymous], PERFORMANCE EVALUATI
[5]  
Cakir E, 2015, IEEE IJCNN
[6]  
Cakir E, 2016, IEEE IJCNN, P3399, DOI 10.1109/IJCNN.2016.7727634
[7]  
Girshick R, 2015, Arxiv, DOI arXiv:1504.08083
[8]   Rich feature hierarchies for accurate object detection and semantic segmentation [J].
Girshick, Ross ;
Donahue, Jeff ;
Darrell, Trevor ;
Malik, Jitendra .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :580-587
[9]   R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection [J].
Kao, Chieh-Chi ;
Wang, Weiran ;
Sun, Ming ;
Wang, Chao .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1358-1362
[10]  
Kingma DP., 2017, A method for stochastic optimization, DOI DOI 10.48550/ARXIV.1412.6980