Visual-Textual Hybrid Sequence Matching for Joint Reasoning

被引：12

作者：

Huang, Xin ^{[1
]}

Peng, Yuxin ^{[1
]}

Wen, Zhang ^{[1
]}

机构：

[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100871, Peoples R China

来源：

IEEE TRANSACTIONS ON CYBERNETICS | 2021年 / 51卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Cognition; Task analysis; Correlation; Media; Feature extraction; Visualization; Image recognition; Entailment recognition; hybrid sequence matching; knowledge transfer; visual-textual reasoning; DOMAIN ADAPTATION;

D O I：

10.1109/TCYB.2019.2956975

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.

引用

页码：5692 / 5705

页数：14

共 15 条

[1] A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition
Cheng, Xinhua
Jia, Mengxi
Wang, Qian
Zhang, Jian
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 6994 - 7004
[2] Visual-Textual Attribute Learning for Class-Incremental Facial Expression Recognition
Lv, Yuanling
Huang, Guangyu
Yan, Yan
Xue, Jing-Hao
Chen, Si
Wang, Hanzi
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8038 - 8051
[3] Advancing Visible-Infrared Person Re-Identification: Synergizing Visual-Textual Reasoning and Cross-Modal Feature Alignment
Qiu, Yuxuan
Wang, Liyang
Song, Wei
Liu, Jiawei
Shi, Zhiping
Jiang, Na
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 2184 - 2196
[4] Sentiment Recognition for Short Annotated GIFs Using Visual-Textual Fusion
Liu, Tianliang
Wan, Junwei
Dai, Xiubin
Liu, Feng
You, Quanzeng
Luo, Jiebo
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 1098 - 1110
[5] Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment
Peng, Yuxin
Ye, Zhaoda
Qi, Jinwei
Zhuo, Yunkan
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (05) : 3669 - 3683
[6] Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction
Zhou, Tao
Cao, Jiuxin
Zhu, Xuelin
Liu, Bo
Li, Shancang
IEEE SYSTEMS JOURNAL, 2021, 15 (03): : 4303 - 4314
[7] MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism
Peng, Yuxin
Qi, Jinwei
Zhuo, Yunkan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 2728 - 2741
[8] Hybrid Graph Reasoning With Dynamic Interaction for Visual Dialog
Du, Shanshan
Wang, Hanli
Li, Tengpeng
Chen, Chang Wen
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9095 - 9108
[9] Zoom-and-Reasoning: Joint Foreground Zoom and Visual-Semantic Reasoning Detection Network for Aerial Images
Ge, Zuhao
Qi, Lizhe
Wang, Yuzheng
Sun, Yunquan
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2572 - 2576
[10] A Mutually Textual and Visual Refinement Network for Image-Text Matching
Pang, Shanmin
Zeng, Yueyang
Zhao, Jiawei
Xue, Jianru
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7555 - 7566

← 1 2 →