Self-Supervised Learning-Based General Fine-tuning Framework For Audio Classification and Event Detection

被引:0
作者
Sun, Yanjie [1 ]
Xu, Kele [1 ]
Dou, Yong [1 ]
Gao, Tian [2 ]
机构
[1] Natl Univ Def Technol, Changsha, Peoples R China
[2] iFlytek Res, Hefei, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME 2024 | 2024年
关键词
Audio classification; Audio event detection; Semantic-aware; Self-supervised learning; Fine-tuning;
D O I
10.1109/ICME57554.2024.10687821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, self-supervised learning (SSL) has made remarkable progress in signal representation and has become a de facto solution for different audio processing tasks. Generally, the SSL consists of the foundation pre-training and downstream fine-tuning phases. However, fine-tuning frameworks may lack universality due to the distinct learning paradigms and model designs employed in audio signal processing tasks. Furthermore, the varying degrees of dataset labeling across different tasks challenge unifying a fine-tuning framework. To address these issues, we propose vec2task, a cross-task general fine-tuning framework based on the SSL pre-trained model. It employs a semantic-aware module and an alternating training strategy, enabling the framework to generalize across various audio signal processing tasks. Additionally, the framework employs automatic audio augmentation strategies, eliminating the requirement for individually tailored algorithms to improve task performance. Experimental validations of the vec2task framework outperformed previous methods in audio classification and event detection tasks, showcasing its generalization ability across tasks.
引用
收藏
页数:6
相关论文
共 36 条
[1]  
[Anonymous], 2017, INT CONF ACOUST SPEE
[2]  
[Anonymous], 2023, DCASE
[3]  
[Anonymous], 2023, DCASE
[4]   MAE-AST: Masked Autoencoding Audio Spectrogram Transformer [J].
Baade, Alan ;
Peng, Puyuan ;
Harwath, David .
INTERSPEECH 2022, 2022, :2438-2442
[5]  
Baevski A., 2020, ADV NEURAL INFORM PR
[6]  
Baevski A, 2020, INT CONF ACOUST SPEE, P7694, DOI [10.1109/ICASSP40776.2020.9054224, 10.1109/icassp40776.2020.9054224]
[7]   HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION [J].
Chen, Ke ;
Du, Xingjian ;
Zhu, Bilei ;
Ma, Zejun ;
Berg-Kirkpatrick, Taylor ;
Dubnov, Shlomo .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :646-650
[8]  
Chen Sanyuan, 2023, PROC ICML, P5178
[9]  
Chung JS, 2018, INTERSPEECH, P1086
[10]  
Dosovitskiy Alexey, 2021, ICLR