Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

被引:23
作者
Jia, Meihuizi [1 ,2 ]
Shen, Xin [3 ]
Shen, Lei [2 ]
Pang, Jinhui [1 ]
Liao, Lejian [1 ]
Song, Yang [2 ]
Chen, Meng [2 ]
He, Xiaodong [2 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] JD AI, Beijing, Peoples R China
[3] Australian Natl Univ, Canberra, ACT, Australia
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
国家重点研发计划;
关键词
multimodal named entity recognition; machine reading comprehension; visual grounding; transfer learning;
D O I
10.1145/3503161.3548427
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art models on Twitter2017, and yields competitive results on Twitter2015.
引用
收藏
页码:3549 / 3558
页数:10
相关论文
共 42 条
[1]  
Arshad Omer, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P337, DOI 10.1109/ICDAR.2019.00061
[2]  
Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1
[3]   Reading Wikipedia to Answer Open-Domain Questions [J].
Chen, Danqi ;
Fisch, Adam ;
Weston, Jason ;
Bordes, Antoine .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1870-1879
[4]  
Chen SW, 2021, AAAI CONF ARTIF INTE, V35, P12666
[5]  
Chen Y, 2021, RECIPROCITY, EVOLUTION, AND DECISION GAMES IN NETWORK AND DATA SCIENCE, P186
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   Camouflaged Object Detection [J].
Fan, Deng-Ping ;
Ji, Ge-Peng ;
Sun, Guolei ;
Cheng, Ming-Ming ;
Shen, Jianbing ;
Shao, Ling .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :2774-2784
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]  
Huang Z., 2015, Bidirectional lstm-crf models for sequence tagging
[10]  
Lample G, 2016, P NAACL HLT, P260, DOI 10.18653/v1/n16-1030