Audio-LLM: Activating the Capabilities of Large Language Models to Comprehend Audio Data

被引：1

作者：

Tang, Dongting Chenchong ^{[1
]}

Liu, Han ^{[1
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL NETWORKS-ISNN 2024 | 2024年 / 14827卷

基金：

中国国家自然科学基金;

关键词：

LLMs; Audio encoding; Prompt tuning; Fine-tuning alignment;

D O I：

10.1007/978-981-97-4399-5_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce Audio-LLM (Link to our work: https://github.com/orallove/audio-LLM), a large language model that improves audio question-answering (AQA) systems and activates the capabilities of large language models to comprehend audio data. Our task entails introducing an encoding method that effectively transforms audio data into embedded representations, enabling LLMs to comprehend and process the information contained within the audio. By undergoing a series of fine-tuning stages, we establish alignment between audio and text, allowing LLMs to leverage both auditory and textual prompts. This alignment enables the model to achieve remarkable performance in automatic speech recognition (ASR), emotion recognition (ER), English-to-Chinese translation (En2Zh), music captioning (MC), and so on, demonstrating its versatility across various downstream applications. In addition, our model can be trained efficiently. During training, we only need to update approximately 20 million parameters, which represent about 0.27% of the entire Audio-LLM model. Furthermore, the discussion part highlights the model's adaptability to zero-shot tasks, positioning Audio-LLM as a significant advancement with far-reaching implications for generalized hearing AI.

引用

页码：133 / 142

页数：10

共 23 条

[1]

2023, Arxiv, DOI [arXiv:2303.08774, DOI 10.48550/ARXIV.2303.08774, 10.48550/arXiv.2303.08774]

[2]

Agostinelli A., 2023, arXiv, DOI 10.48550/arXiv.2301.11325

[3]

Anil R, 2023, Arxiv, DOI [arXiv:2305.10403, 10.48550/arXiv.2305.10403, DOI 10.48550/ARXIV.2305.10403]

[4]

Baevski A, 2020, ADV NEUR IN, V33

[5]

Borsos Z, 2022, Arxiv, DOI [arXiv:2209.03143, 10.1109/TASLP.2023.3288409]

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[8]

Chen FL, 2023, Arxiv, DOI arXiv:2305.04160

[9]

Chen S., 2023, ICML

[10]

Chiang Wei-Lin, 2023, Vicuna: An Open -Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

← 1 2 3 →