MPMRC-MNER: A Unified MRC framework for Multimodal Named Entity Recognition based Multimodal Prompt

被引：5

作者：

Bao, Xigang ^{[1
]}

Tian, Mengyuan ^{[1
]}

Zha, Zhiyuan ^{[1
]}

Qin, Biao ^{[1
]}

机构：

[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Multimodal Named Entity Recognition; Multimodal Prompt; Contrastive Learning;

D O I：

10.1145/3583780.3614975

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal named entity recognition (MNER) is a vision-language task, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods often regard an image as a set of visual objects, trying to explicitly capture the relations between visual objects and entities. However, since visual objects are often not identical to entities in quantity and type, they may suffer the bias introduced by visual objects rather than aid. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in many methods, in this paper, we propose a Multimodal Prompt-based Machine Reading Comprehension based framework to implicit alignment between text and image for improving MNER, namely MPMRC-MNER. Specifically, we transform text-only query in MRC into multimodal prompt containing image tokens and text tokens. To better integrate image tokens and text tokens, we design a prompt-aware attention mechanism for better cross-modal fusion. At last, contrastive learning with two types of contrastive losses is designed to learn more consistent representation of two modalities and reduce noise. Extensive experiments and analyses on two public MNER datasets, Twitter2015 and Twitter2017, demonstrate the better performance of our model against the state-of-the-art methods.

引用

页码：47 / 56

页数：10

共 37 条

[1]

[Anonymous], 2016, P C N AM CHAPT ASS C

[2]

Arshad Omer, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P337, DOI 10.1109/ICDAR.2019.00061

[3] Reading Wikipedia to Answer Open-Domain Questions [J].

Chen, Danqi ;

Fisch, Adam ;

Weston, Jason ;

Bordes, Antoine .

PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1870-1879

[4] Multimodal Named Entity Recognition with Image Attributes and Image Knowledge [J].

Chen, Dawei ;

Li, Zhixu ;

Gu, Binbin ;

Chen, Zhigang .

DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2021), PT II, 2021, 12682 :186-201

[5]

Chen SW, 2021, AAAI CONF ARTIF INTE, V35, P12666

[6]

Chen T., 2021, P 59 ANN M ASS COMP, V1, P6191, DOI [10.18653/V1/2021.ACL, DOI 10.18653/V1/2021.ACL-LONG.483]

[7]

Chen T, 2020, PR MACH LEARN RES, V119

[8] Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [J].

Chen, Xiang ;

Zhang, Ningyu ;

Li, Lei ;

Deng, Shumin ;

Tan, Chuanqi ;

Xu, Changliang ;

Huang, Fei ;

Si, Luo ;

Chen, Huajun .

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, :904-915

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

← 1 2 3 4 →