Transformer-based models for multimodal irony detection

被引:16
作者
Tomás D. [1 ]
Ortega-Bueno R. [2 ]
Zhang G. [3 ]
Rosso P. [2 ]
Schifanella R. [4 ]
机构
[1] Department of Software and Computing Systems, University of Alicante, Alicante
[2] PRHLT Research Center, Universitat Politècnica de València, Valencia
[3] School of Information Management, Wuhan University, Wuhan
[4] Applied Research on Computational Complex Systems, University of Turin, Turin
关键词
Image text fusion; Irony detection; Multimodality; Transformer;
D O I
10.1007/s12652-022-04447-y
中图分类号
学科分类号
摘要
Irony is nowadays a pervasive phenomenon in social networks. The multimodal functionalities of these platforms (i.e., the possibility to attach audio, video, and images to textual information) are increasingly leading their users to employ combinations of information in different formats to express their ironic thoughts. The present work focuses on the study of irony detection in social media posts involving image and text. To this end, a transformer architecture for the fusion of textual and image information is proposed. The model leverages disentangled text attention with visual transformers, improving F1-score up to 9% over previous existing works in the field and current state-of-the-art visio-linguistic transformers. The proposed architecture was evaluated in three different multimodal datasets gathered from Twitter and Tumblr. The results revealed that, in many situations, the text-only version of the architecture was able to capture the ironic nature of the message without using visual information. This phenomenon was further analysed, leading to the identification of linguistic patterns that could provide the context necessary for irony detection without the need for additional visual information. © 2022, The Author(s).
引用
收藏
页码:7399 / 7410
页数:11
相关论文
共 32 条
[1]  
Agarap A.F., Deep learning using rectified linear units (ReLU), Arxiv, 1803, (2018)
[2]  
Alam F., Cresci S., Chakraborty T., A Survey on Multimodal Disinformation Detection
[3]  
Cai Y., Cai H., Wan X., Multi-modal sarcasm detection in Twitter with hierarchical fusion model. In: Proceedings of the 57th annual meeting of the ACL, Association for Computational Linguistics, pp. 2506-2515, (2019)
[4]  
Cignarella A.T., Basile V., Sanguinetti M, et al (2020a) Multilingual irony detection with dependency syntax and neural models, Proceedings of the 28Th International Conference on Computational Linguistics. International Committee on Computational Linguistics, pp. 1346-1358
[5]  
Cignarella A.T., Sanguinetti M., Bosco C, et al (2020b) Marking irony activators in a Universal Dependencies treebank: The case of an Italian Twitter corpus, Proceedings of the 12Th Language Resources and Evaluation Conference. European Language Resources Association, pp. 5098-5105
[6]  
Conneau A., Khandelwal K., Goyal N, et al (2020) Unsupervised cross-lingual representation learning at scale, Proceedings of the 58Th Annual Meeting of the Association for Computational Linguistics, pp. 8440-8451
[7]  
Devlin J., Chang M.W., Lee K., Et al., Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics, Association for Computational Linguistics, pp. 4171-4186, (2019)
[8]  
Dosovitskiy A., Beyer L., Kolesnikov A., Et al., An image is worth 16x16 words: Transformers for image recognition at scale, International Conference on Learning Representations, pp. 1-21, (2021)
[9]  
Gadzicki K., Khamsehashari R., Zetzsche C., Early vs late fusion in multimodal convolutional neural networks, 2020 IEEE 23Rd International Conference on Information Fusion (FUSION), pp. 1-6, (2020)
[10]  
Giachanou A., Zhang G., Rosso P., Multimodal fake news detection with textual, visual and semantic information. Text, speech, and dialogue, pp. 30-38, (2020)