Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models

被引：0

作者：

Xu, Xuenan ^{[1
]}

Zhang, Pingyue ^{[1
]}

Yang, Ming ^{[2
]}

Zhang, Ji ^{[2
]}

Wu, Mengyue ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, X LANCE Lab, Shanghai, Peoples R China

[2] Alibaba Grp, Inst Intelligent Comp, Hangzhou, Peoples R China

来源：

INTERSPEECH 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

zero-shot learning; audio classification; sound attribute; large language model; audio-text contrastive learning;

D O I：

10.21437/Interspeech.2024-1692

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet(1). Our results demonstrate a substantial improvement in zero-shot classification accuracy. Ablation results show robust performance enhancement, regardless of the model architecture.

引用

页码：4808 / 4812

页数：5