Evaluation of Environmental Sound Classification using Vision Transformer

被引：0

作者：

Wang, Changlong ^{[1
]}

Ito, Akinori ^{[1
]}

Nose, Takashi ^{[1
]}

Chen, Chia-Ping ^{[2
]}

机构：

[1] Tohoku Univ, Sendai, Miyagi, Japan

[2] Natl Sun Yat Sen Univ, Kaohsiung, Taiwan

来源：

2024 16TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, ICMLC 2024 | 2024年

关键词：

Audio Classification; Environmental Sound Classification; ESC-50; Vision Transformer;

D O I：

10.1145/3651671.3651733

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based vision transformers have achieved significant success in audio classification tasks. While most cases of vision transformers for audio classification focus on achieving satisfactory scores in the end, there is still a lack of detailed evaluation from a practical standpoint. In this study, we conducted a comparative study, boosting vision transformers on the ESC-50 dataset step by step. Our goal is to provide practitioners with a solid foundation for adapting vision transformers to the general Environmental Sound Classification task. Our comparative study encompasses various aspects, including model setting, data augmentation, cross-domain transfer learning, and model pruning, offering practical insights for the implementation of vision transformers in real-world scenarios.

引用

页码：665 / 669

页数：5

共 28 条

[1] Environmental Sound Classification: A descriptive review of the literature [J].

Bansal, Anam ;

Garg, Naresh Kumar .

INTELLIGENT SYSTEMS WITH APPLICATIONS, 2022, 16

[2] FlexiViT: One Model for All Patch Sizes [J].

Beyer, Lucas ;

Izmailov, Pavel ;

Kolesnikov, Alexander ;

Caron, Mathilde ;

Kornblith, Simon ;

Zhai, Xiaohua ;

Minderer, Matthias ;

Tschannen, Michael ;

Alabdulmohsin, Ibrahim ;

Pavetic, Filip .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14496-14506

[3] Classifying environmental sounds using image recognition networks [J].

Boddapati, Venkatesh ;

Petef, Andrej ;

Rasmusson, Jim ;

Lundberg, Lars .

KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS, 2017, 112 :2048-2056

[4] HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION [J].

Chen, Ke ;

Du, Xingjian ;

Zhu, Bilei ;

Ma, Zejun ;

Berg-Kirkpatrick, Taylor ;

Dubnov, Shlomo .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :646-650

[5]

Chen S., 2022, Beats: Audio pretraining with acoustic tokenizers

[6]

Dai W, 2017, INT CONF ACOUST SPEE, P421, DOI 10.1109/ICASSP.2017.7952190

[7]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[8]

Elizalde B, 2024, Arxiv, DOI arXiv:2309.05767

[9]

Elizalde Benjamin, 2023, ICASSP 2023, P1

[10]

Gazneli A, 2022, Arxiv, DOI arXiv:2204.11479

← 1 2 3 →