Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

被引：23

作者：

Jia, Meihuizi ^{[1
,2
]}

Shen, Xin ^{[3
]}

Shen, Lei ^{[2
]}

Pang, Jinhui ^{[1
]}

Liao, Lejian ^{[1
]}

Song, Yang ^{[2
]}

Chen, Meng ^{[2
]}

He, Xiaodong ^{[2
]}

机构：

[1] Beijing Inst Technol, Beijing, Peoples R China

[2] JD AI, Beijing, Peoples R China

[3] Australian Natl Univ, Canberra, ACT, Australia

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

国家重点研发计划;

关键词：

multimodal named entity recognition; machine reading comprehension; visual grounding; transfer learning;

D O I：

10.1145/3503161.3548427

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art models on Twitter2017, and yields competitive results on Twitter2015.

引用

页码：3549 / 3558

页数：10

共 42 条

[1]

Arshad Omer, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P337, DOI 10.1109/ICDAR.2019.00061

[2]

Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1

[3] Reading Wikipedia to Answer Open-Domain Questions [J].

Chen, Danqi ;

Fisch, Adam ;

Weston, Jason ;

Bordes, Antoine .

PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1870-1879

[4]

Chen SW, 2021, AAAI CONF ARTIF INTE, V35, P12666

[5]

Chen Y, 2021, RECIPROCITY, EVOLUTION, AND DECISION GAMES IN NETWORK AND DATA SCIENCE, P186

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7] Camouflaged Object Detection [J].

Fan, Deng-Ping ;

Ji, Ge-Peng ;

Sun, Guolei ;

Cheng, Ming-Ming ;

Shen, Jianbing ;

Shao, Ling .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :2774-2784

[8] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[9]

Huang Z., 2015, Bidirectional lstm-crf models for sequence tagging

[10]

Lample G, 2016, P NAACL HLT, P260, DOI 10.18653/v1/n16-1030

← 1 2 3 4 5 →