Dual representations: A novel variant of Self-Supervised Audio Spectrogram Transformer with multi-layer feature fusion and pooling combinations for sound classification

被引:0
作者
Choi, Hyosun [1 ]
Zhang, Li [1 ]
Watkins, Chris [1 ]
机构
[1] Royal Holloway Univ London, Dept Comp Sci, Egham TW20 0EX, Surrey, England
关键词
Transformer; Embeddings; Multi-layer feature fusion; Pooling combinations;
D O I
10.1016/j.neucom.2025.129415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Self-Supervised Audio Spectrogram Transformer (SSAST) has recently been verified as the state-of-the-art model in various audio and speech command classification tasks. SSAST uses self-supervised learning to reduce the need of substantial data to pre-train transformers, removing the disadvantage of its supervised learning counterpart, the Audio Spectrogram Transformer (AST) model. Owing to the fact that transformers such as SSAST use only feature representations from the last layer for downstream classification tasks, we believe that such a process will lose some important information from middle layers during training. Therefore, in this research, we propose a novel variant of the SSAST model using a dual representation generated using fusion of the outputs from multi-layers (i.e. both middle and last layers) for audio classification. Specifically, we apply all-patch-wise pooling combinations to all patches from both a middle layer and the last layer of a pre-trained patch-based self-supervised learning model. As such, it generates two individual sequences of the output patches based on a variety of mean, max, and min pooling combinations to make the final double- sized representation. This dual representation includes more discriminative information and better knowledge, providing the linear multi-layer perceptron head layers with more useful information for audio classification. In comparison with existing state-of-the-art studies, the proposed model using the dual representations yielded by multi-layer feature fusion and pooling combinations significantly boosts performance on all tasks. The resulting accuracy rates are 93.67%, 100%, 79.59%, 79.59%, 91.22%, and 85.90% for CREMA-D, TESS, RAVDESS, Speech Emotion Classification, Isolated Urban Events, and CornellBirdCall, respectively.
引用
收藏
页数:15
相关论文
共 67 条
[1]   Speech emotion classification using attention based network and regularized feature selection [J].
Akinpelu, Samson ;
Viriri, Serestina .
SCIENTIFIC REPORTS, 2023, 13 (01)
[2]  
[Anonymous], 2024, Cornell birdcall identification
[3]  
Ansar W., 2024, SN Comput. Sci., V5, DOI [10.1007/s42979-023-02591-6, DOI 10.1007/S42979-023-02591-6]
[4]  
Baevski A, 2020, Arxiv, DOI [arXiv:2006.11477, DOI 10.48550/ARXIV.2006.11477]
[5]   A Squeeze-and-Excitation and Transformer-Based Cross-Task Model for Environmental Sound Recognition [J].
Bai, Jisheng ;
Chen, Jianfeng ;
Wang, Mou ;
Ayub, Muhammad Saad ;
Yan, Qingli .
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (03) :1501-1513
[6]   CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset [J].
Cao, Houwei ;
Cooper, David G. ;
Keutmann, Michael K. ;
Gur, Ruben C. ;
Nenkova, Ani ;
Verma, Ragini .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2014, 5 (04) :377-390
[7]   Real-Time Speech Emotion Analysis for Smart Home Assistants [J].
Chatterjee, Rajdeep ;
Mazumdar, Saptarshi ;
Sherratt, R. Simon ;
Halder, Rohit ;
Maitra, Tanmoy ;
Giri, Debasis .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2021, 67 (01) :68-76
[8]  
Croitoru FA, 2024, Arxiv, DOI [arXiv:2205.09180, 10.48550/arXiv.2205.09180]
[9]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[10]   Heart Sound Segmentation Using Bidirectional LSTMs With Attention [J].
Fernando, Tharindu ;
Ghaemmaghami, Houman ;
Denman, Simon ;
Sridharan, Sridha ;
Hussain, Nayyar ;
Fookes, Clinton .
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2020, 24 (06) :1601-1609