Spatio-Temporal Attention Pooling for Audio Scene Classification

被引:9
作者
Phan, Huy [1 ]
Chen, Oliver Y. [2 ]
Pham, Lam [1 ]
Koch, Philipp [3 ]
De Vos, Maarten [2 ]
McLoughlin, Ian [1 ]
Mertins, Alfred [3 ]
机构
[1] Univ Kent, Sch Comp, Canterbury, Kent, England
[2] Univ Oxford, Dept Engn Sci, Oxford, England
[3] Univ Lubeck, Inst Signal Proc, Lubeck, Germany
来源
INTERSPEECH 2019 | 2019年
关键词
audio scene classification; attention pooling; convolutional neural network; recurrent neural network; NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2019-3040
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with between-class examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.
引用
收藏
页码:3845 / 3849
页数:5
相关论文
共 40 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
[Anonymous], 2018, P DCASE WORKSH
[3]  
[Anonymous], 2018, ARXIV180507319
[4]  
Bisot V, 2016, INT CONF ACOUST SPEE, P6445, DOI 10.1109/ICASSP.2016.7472918
[5]  
Bisot V, 2015, EUR SIGNAL PR CONF, P719, DOI 10.1109/EUSIPCO.2015.7362477
[6]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[7]   Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection [J].
Cakir, Emre ;
Parascandolo, Giambattista ;
Heittola, Toni ;
Huttunen, Heikki ;
Virtanen, Tuomas .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (06) :1291-1303
[8]  
Cho K, 2014, ARXIV14061078
[9]  
Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
[10]  
Guo J., 2017, P INTERSPEECH