Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation

被引：0

作者：

Huang, Zhiying ^{[1
]}

Zhang, Shiliang ^{[1
]}

Lei, Ming ^{[1
]}

机构：

[1] Alibaba Inc, Hangzhou, Peoples R China

来源：

INTERSPEECH 2019 | 2019年

关键词：

Audio Set; audio tagging; compact feedforward sequential memory network; audio-to-audio ratio; data augmentation; CLASSIFICATION;

D O I：

10.21437/Interspeech.2019-1302

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Audio tagging aims to identify the presence or absence of audio events in the audio clip. Recently, a lot of researchers have paid attention to explore different model structures to improve the performance of audio tagging. Convolutional neural network (CNN) is the most popular choice among a wide variety of model structures, and it's successfully applied to audio events prediction task. However, the model complexity of CNN is relatively high, which is not efficient enough to ship in real product. In this paper, compact Feedforward Sequential Memory Network (cFSMN) is proposed for audio tagging task. Experimental results show that cFSMN-based system yields a comparable performance with the CNN-based system. Meanwhile, an audio-to-audio ratio (AAR) based data augmentation method is proposed to further improve the classifier performance. Finally, with raw waveforms of the balanced training set of Audio Set which is a published standard database, our system can achieve a state-of-the-art performance with AUC being 0.932. Moreover, cFSMN-based model has only 1.9 million parameters, which is only about 1/30 of the CNN-based model.

引用

页码：3377 / 3381

页数：5

共 35 条

[1]

[Anonymous], 2016, Proceedings of the 17th International Society for Music Information Retrieval Conference, DOI [DOI 10.5281/ZENODO.1416254, 10.5281/zenodo.1416254]

[2]

[Anonymous], 2016, DCASE2016 AUDIO TAGG

[3]

[Anonymous], 2016, TECH REP DCASE2016 C

[4]

[Anonymous], 2018, PAC RIM C MULT, DOI DOI 10.1007/978-3-030-00764-5_2

[5]

[Anonymous], 2017, IEEE IJCNN, DOI DOI 10.1109/IJCNN.2017.7966291

[6]

[Anonymous], 2015, ARXIV151208301

[7]

Bi MX, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4794, DOI 10.1109/ICASSP.2018.8461623

[8]

Cakir E., 2016, IEEE AASP CHALLENGE

[9] Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting [J].

Chen, Mengzhe ;

Zhang, Shiliang ;

Lei, Ming ;

Liu, Yong ;

Yao, Haitao ;

Gao, Jie .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2663-2667

[10]

Foster Peter, 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Proceedings, P1, DOI 10.1109/WASPAA.2015.7336899

← 1 2 3 4 →