Adapting Language-Audio Models as Few-Shot Audio Learners

被引：8

作者：

Liang, Jinhua ^{[1
]}

Liu, Xubo ^{[2
]}

Liu, Haohe ^{[2
]}

Phan, Huy ^{[3
]}

Benetos, Emmanouil ^{[2
,4
]}

Plumbley, Mark D. ^{[2
]}

Wang, Wenwu ^{[2
]}

机构：

[1] Queen Mary Univ London, Ctr Digital Mus, London, England

[2] Ctr Vis Speech & Signal Proc CVSSP, London, England

[3] Amazon Alexa, London, England

[4] Alan Turing Inst, London, England

来源：

INTERSPEECH 2023 | 2023年

基金：

英国工程与自然科学研究理事会;

关键词：

Contrastive language-audio pretraining; few-shot learning; domain adaptation; audio classification;

D O I：

10.21437/Interspeech.2023-1082

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Contrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zero-shot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to finetune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a cross-attention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter outperforms metric-based methods in few-shot settings and yields competitive results to fully-supervised methods.

引用

页码：276 / 280

页数：5

共 23 条

[1]

Alayrac JB, 2022, ADV NEUR IN

[2] HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION [J].

Chen, Ke ;

Du, Xingjian ;

Zhu, Bilei ;

Ma, Zejun ;

Berg-Kirkpatrick, Taylor ;

Dubnov, Shlomo .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :646-650

[3]

Elizalde B, 2022, Arxiv, DOI arXiv:2206.04769

[4]

Fonseca E., 2018, P WORKSH DET REC WIL, P69

[5] FSD50K: An Open Dataset of Human-Labeled Sound Events [J].

Fonseca, Eduardo ;

Favory, Xavier ;

Pons, Jordi ;

Font, Frederic ;

Serra, Xavier .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :829-852

[6]

Gagnon-Audet J.-C., 2023, ICLR 2023 WORKSH MAT

[7] AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO [J].

Guzhov, Andrey ;

Raue, Federico ;

Hees, Joern ;

Dengel, Andreas .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :976-980

[8]

Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132

[9]

Jaegle A., 2022, Perceiver io: A general architecture for structured inputs & outputs

[10]

Koch G., 2015, ICML DEEP LEARN WORK, V2

← 1 2 3 →