Adapting Language-Audio Models as Few-Shot Audio Learners

被引:8
作者
Liang, Jinhua [1 ]
Liu, Xubo [2 ]
Liu, Haohe [2 ]
Phan, Huy [3 ]
Benetos, Emmanouil [2 ,4 ]
Plumbley, Mark D. [2 ]
Wang, Wenwu [2 ]
机构
[1] Queen Mary Univ London, Ctr Digital Mus, London, England
[2] Ctr Vis Speech & Signal Proc CVSSP, London, England
[3] Amazon Alexa, London, England
[4] Alan Turing Inst, London, England
来源
INTERSPEECH 2023 | 2023年
基金
英国工程与自然科学研究理事会;
关键词
Contrastive language-audio pretraining; few-shot learning; domain adaptation; audio classification;
D O I
10.21437/Interspeech.2023-1082
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Contrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zero-shot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to finetune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a cross-attention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter outperforms metric-based methods in few-shot settings and yields competitive results to fully-supervised methods.
引用
收藏
页码:276 / 280
页数:5
相关论文
共 23 条
[1]  
Alayrac JB, 2022, ADV NEUR IN
[2]   HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION [J].
Chen, Ke ;
Du, Xingjian ;
Zhu, Bilei ;
Ma, Zejun ;
Berg-Kirkpatrick, Taylor ;
Dubnov, Shlomo .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :646-650
[3]  
Elizalde B, 2022, Arxiv, DOI arXiv:2206.04769
[4]  
Fonseca E., 2018, P WORKSH DET REC WIL, P69
[5]   FSD50K: An Open Dataset of Human-Labeled Sound Events [J].
Fonseca, Eduardo ;
Favory, Xavier ;
Pons, Jordi ;
Font, Frederic ;
Serra, Xavier .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :829-852
[6]  
Gagnon-Audet J.-C., 2023, ICLR 2023 WORKSH MAT
[7]   AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO [J].
Guzhov, Andrey ;
Raue, Federico ;
Hees, Joern ;
Dengel, Andreas .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :976-980
[8]  
Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132
[9]  
Jaegle A., 2022, Perceiver io: A general architecture for structured inputs & outputs
[10]  
Koch G., 2015, ICML DEEP LEARN WORK, V2