Multi-modal mask Transformer network for social event classification

被引：0

作者：

Chen H. ^{[1
]}

Qian S. ^{[2
]}

Li Z. ^{[2
]}

Fang Q. ^{[2
]}

Xu C. ^{[2
]}

机构：

[1] Henan Institute of Advanced Technology, Zhengzhou University, Zhengzhou

[2] Institute of Automation, Chinese Academy of Sciences, Beijing

来源：

Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics | 2024年 / 50卷 / 02期

基金：

中国国家自然科学基金;

关键词：

crisis event classification; multi-modal; multi-model Transformer network; representation learning; social media;

D O I：

10.13700/j.bh.1001-5965.2022.0388

中图分类号：

学科分类号：

摘要：

Utilizing both the properties of the text and image modalities to the fullest extent possible is essential for multi-modal social event classification. However,most of the existing methods have the following limitations: They simply concatenate the image features and textual features of events. The existence of irrelevant contextual information between different modalities leads to mutual interference. Therefore,it is not enough to only consider the relationship between modalities of multimodal data,but also consider irrelevant contextual information between modalities (such as regions or words). To overcome these limitations,this paper proposes a novel social event classification method based on multimodal mask transformer network (MMTN) model. Specifically,the authors learn better representations of text and images through an image-text encoding network. To combine multimodal data,the resultant picture and word representations are input into a multimodal mask Transformer network. By calculating the similarity between the multimodal information,the relationship between the modalities of the multimodal information is modeled,and the irrelevant contexts between the modalities are masked. Extensive experiments on two benchmark datasets demonstrate that the proposed model achieves the state-of-the-art performance. © 2024 Beijing University of Aeronautics and Astronautics (BUAA). All rights reserved.

引用

页码：579 / 587

页数：8

共 28 条

[1] KUMAR S, BARBIER G, ABBASI M, Et al., TweetTracker: An analysis tool for humanitarian and disaster relief, Proceedings of the International AAAI Conference on Web and Social Media, 5, 1, pp. 661-662, (2021)
[2] SHEKHAR H, SETTY S., Disaster analysis through tweets, Proceedings of the 2015 International Conference on Advances in Computing, Communications and Informatics, pp. 1719-1723, (2015)
[3] STOWE K, PAUL M J, PALMER M, Et al., Identifying and categorizing disaster-related tweets, Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media, (2016)
[4] TO H, AGRAWAL S, KIM S H, Et al., On identifying disaster-related tweets: Matching-based or learning-based?, Proceedings of the 2017 IEEE Third International Conference on Multimedia Big Data, pp. 330-337, (2017)
[5] MOUZANNAR H, RIZK Y, AWAD M., Damage identification in social media posts using multimodal deep learning, Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management, (2018)
[6] KELLY S, ZHANG X B, AHMAD K., Mining multimodal information on social media for increased situational awareness, Proceedings of the 14th International Conference on Information Systems for Crisis Response and Management, (2017)
[7] ABAVISANI M, WU L W, HU S L, Et al., Multimodal categorization of crisis events in social media, Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14667-14677, (2020)
[8] HE K M, ZHANG X Y, REN S Q, Et al., Deep residual learning for image recognition, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[9] DEVLIN J, CHANG M W, LEE K, Et al., BERT: Pre-training of deep bidirectional Transformers for language understanding
[10] VASWANI A, SHAZEER N, PARMAR N, Et al., Attention is all you need, Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000-6010, (2017)

← 1 2 3 →