ICKA: An instruction construction and Knowledge Alignment framework for Multimodal Named Entity Recognition

被引：1

作者：

Zeng, Qingyang ^{[1
]}

Yuan, Minghui ^{[1
]}

Wan, Jing ^{[1
]}

Wang, Kunfeng ^{[1
]}

Shi, Nannan ^{[2
]}

Che, Qianzi ^{[2
]}

Liu, Bin ^{[2
]}

机构：

[1] Beijing Univ Chem Technol, Beijing 100029, Peoples R China

[2] China Acad Chinese Med Sci, Inst Basic Res Clin Med, Beijing 100700, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 255卷

基金：

北京市自然科学基金;

关键词：

Multimodal Named Entity Recognition; Multimodal learning; Semantic alignment; Visual language model; Social media; FUSION;

D O I：

10.1016/j.eswa.2024.124867

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal Named Entity Recognition (MNER) aims to identify entities of predefined types in text by leveraging information from multiple modalities, most notably textual and visual information. Most efforts concentrate on improving cross-modality attention mechanisms to facilitate guidance between modalities. However, they still suffer from certain limitations: (1) it is difficult to establish a unified representation to bridge the semantic gap among different modalities; (2) mining the implicit relationships between text and image is crucial yet challenging. In this paper, we propose an Instruction Construction and Knowledge Alignment Framework for MNER named ICKA to address these issues. Specifically, we first employ a multi- head cross-modal attention mechanism to obtain the cross-modal fusion representation by fusing features from text-image pairs. Then, we integrate external knowledge from the pre-trained vision-language model (VLM) to facilitate semantic alignment between text and image and obtain inter-modality connections. Next, we construct the multimodal instruction that consists of the modal features and uses the inter-modality connections as a bridge between them. We then integrate the instruction into the language model to effectively incorporate multimodal knowledge. Finally, we perform sequence labeling using a Conditional Random Fields (CRF) decoder with a gating mechanism. The proposed method achieves F1 scores of 75.42% on the Twitter2015 dataset and 87.12% on the Twitter2017 dataset, demonstrating the competitiveness of our method.

引用

页数：10

共 50 条

[1] MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition
Xu, Bo
Huang, Shizhou
Sha, Chaofeng
Wang, Hongya
WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 1215 - 1223
[2] Dynamic Graph Construction Framework for Multimodal Named Entity Recognition in Social Media
Mai, Weixing
Zhang, Zhengxuan
Li, Kuntao
Xue, Yun
Li, Fenghuan
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (02) : 2513 - 2522
[3] Multimodal Named Entity Recognition with Image Attributes and Image Knowledge
Chen, Dawei
Li, Zhixu
Gu, Binbin
Chen, Zhigang
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2021), PT II, 2021, 12682 : 186 - 201
[4] A Survey on Multimodal Named Entity Recognition
Qian, Shenyi
Jin, Wenduo
Chen, Yonggang
Ma, Jiangtao
Qiao, Yaqiong
Lu, Jinyu
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT IV, 2023, 14089 : 609 - 622
[5] Chinese Named Entity Recognition for Clothing Knowledge Graph Construction
Zhu, Ming
Zhen, De-sheng
2019 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE APPLICATIONS AND TECHNOLOGIES (AIAAT 2019), 2019, 646
[6] Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition
Jia, Meihuizi
Shen, Xin
Shen, Lei
Pang, Jinhui
Liao, Lejian
Song, Yang
Chen, Meng
He, Xiaodong
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3549 - 3558
[7] A Multi-expert Collaborative Framework for Multimodal Named Entity Recognition
Xu, Bo
Jiang, Haiqi
Wei, Shouang
Du, Ming
Song, Hui
Wang, Hongya
MULTIMEDIA MODELING, MMM 2025, PT I, 2025, 15520 : 30 - 43
[8] MESA: A Multimodal Entity Entailment framework for multimodal Entity Alignment
Zhao, Yu
Zhang, Ying
Sui, Xuhui
Cai, Xiangrui
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (01)
[9] Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition
He, Li
Wang, Qingxiang
Liu, Jie
Duan, Jianyong
Wang, Hao
APPLIED SCIENCES-BASEL, 2024, 14 (06):
[10] A multi-task framework based on decomposition for multimodal named entity recognition
Cai, Chenran
Wang, Qianlong
Qin, Bing
Xu, Ruifeng
NEUROCOMPUTING, 2024, 604

← 1 2 3 4 5 →