Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection

被引：2

作者：

Ou, Lisong ^{[1
,2
]}

Li, Zhixin ^{[1
,2
]}

机构：

[1] Guangxi Normal Univ, Minist Educ, Key Lab Educ Blockchain & Intelligent Technol, Guilin, Guangxi, Peoples R China

[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin, Guangxi, Peoples R China

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

multi-modal sarcasm detection; aggregation network; multi-task CLIP framework; cross-modal interaction; SENTIMENT;

D O I：

10.1145/3652583.3658015

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the continuous emergence of various types of social media, which people often use to express their emotions in daily life, the multi-modal sarcasm detection (MSD) task has attracted more and more attention. However, due to the unique nature of sarcasm itself, there are still two main challenges on the way to achieving robust MSD: 1) existing mainstream methods often fail to take into account the problem of multi-modal weak correlation, thus ignoring the important sarcasm information of the uni-modal itself; 2) inefficiency in modeling cross-modal interactions in unaligned multi-modal data. Therefore, this paper proposes a multi-task jointly trained aggregation network (MTAN), which mainly adopts networks adapted to different modalities according to different modality processing tasks. Specifically, we design a multi-task CLIP framework that includes an uni-modal text task, an uni-modal image task, and a cross-modal interaction task, which can utilize sentiment cues from multiple tasks for multi-modal sarcasm detection. In addition, we design a global-local cross-modal interaction learning method that utilizes discourse-level representations from each modality as the global multi-modal context to interact with local uni-modal features, which not only avoids the secondary scaling cost of previous local-local cross-modal interaction methods but also allows the global multi-modal context and local uni-modal features to be mutually reinforcing and progressively improved through multi-layer superposition. After extensive experimental results and in-depth analysis, our model achieves state-of-the-art performance in multi-modal sarcasm detection.

引用

页码：833 / 841

页数：9

共 46 条

[1] Babanejad Nastaran, 2020, P 28 INT C COMP LING, P225, DOI DOI 10.18653/V1/2020.COLING-MAIN.20
[2] Cai YT, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2506
[3] Charbuty B., 2021, J APPL SCI TECHNOL T, V2, P20, DOI [10.38094/jastt20165, DOI 10.38094/JASTT20165]
[4] Chauhan DS, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4351
[5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6] Dey R, 2017, MIDWEST SYMP CIRCUIT, P1597, DOI 10.1109/MWSCAS.2017.8053243
[7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[8] Ghosh A., 2017, P 2017 C EMP METH NA, P482, DOI DOI 10.18653/V1
[9] ON THE PSYCHOLINGUISTICS OF SARCASM
GIBBS, RW
[J]. JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 1986, 115 (01) : 3 - 15
[10] Hu Xueting, 2024, WACV, P5594

← 1 2 3 4 5 →