End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts

被引:0
作者
Lyu, Yifan [1 ]
Hu, Jiapei [1 ]
Xue, Yun [1 ]
Cai, Qianhua [1 ]
机构
[1] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China
来源
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS | 2024年
基金
中国国家自然科学基金;
关键词
Contrastive learning; multimodal named entity recognition (MNER); visual grounding (VG);
D O I
10.1109/TCSS.2024.3402738
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Multimodal named entity recognition (MNER) for social media aims to detect named entities in user-generated posts with the aid of visual information from attached images. Existing methods use pretrained visual models or visual grounding (VG) toolkits to learn visual information. However, they still suffer from the mismatch issue, where the visual features extracted from visual encoder are inconsistent with actual requirements for cross-modal interaction. In an ideal scenario, the visual encoder should actively extract visual information guided by the text, which inherently provides the blueprint of desired visual features. In this article, we present an end-to-end VG framework for MNER task (VG-MNER), which adaptively learns the text-related visual features. Specifically, we introduce a backbone network with a feature fusion module to learn and aggregate multisize visual representations. We then develop a text-related visual attention to refine the visual features. Notably, entity-image contrast loss is designed to guide the training of visual encoder. The proposed model outperforms several state-of-the-art methods, achieving F1 scores of 75.62% and 88.11% on two benchmark datasets. Experimental results reveal the effectiveness of leveraging text-related visual information in the MNER task.
引用
收藏
页码:7223 / 7233
页数:11
相关论文
共 38 条
[1]  
Arshad Omer, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P337, DOI 10.1109/ICDAR.2019.00061
[2]   On development of multimodal named entity recognition using part-of-speech and mixture of experts [J].
Chen, Jianying ;
Xue, Yun ;
Zhang, Haolan ;
Ding, Weiping ;
Zhang, Zhengxuan ;
Chen, Jiehai .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (06) :2181-2192
[3]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[4]   Sigmoid-weighted linear units for neural network function approximation in reinforcement learning [J].
Elfwing, Stefan ;
Uchibe, Eiji ;
Doya, Kenji .
NEURAL NETWORKS, 2018, 107 :3-11
[5]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[6]   GDN-CMCF: A Gated Disentangled Network With Cross-Modality Consensus Fusion for Multimodal Named Entity Recognition [J].
Huang, Guoheng ;
He, Qin ;
Dai, Zihao ;
Zhong, Guo ;
Yuan, Xiaochen ;
Pun, Chi-Man .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (03) :3944-3954
[7]  
Ioffe S, 2015, PR MACH LEARN RES, V37, P448
[8]  
Jia C, 2021, PR MACH LEARN RES, V139
[9]  
Jia MHZ, 2023, AAAI CONF ARTIF INTE, P8032
[10]   Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition [J].
Jia, Meihuizi ;
Shen, Xin ;
Shen, Lei ;
Pang, Jinhui ;
Liao, Lejian ;
Song, Yang ;
Chen, Meng ;
He, Xiaodong .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :3549-3558