Visual-Textual Hybrid Sequence Matching for Joint Reasoning

被引:12
|
作者
Huang, Xin [1 ]
Peng, Yuxin [1 ]
Wen, Zhang [1 ]
机构
[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100871, Peoples R China
基金
中国国家自然科学基金;
关键词
Cognition; Task analysis; Correlation; Media; Feature extraction; Visualization; Image recognition; Entailment recognition; hybrid sequence matching; knowledge transfer; visual-textual reasoning; DOMAIN ADAPTATION;
D O I
10.1109/TCYB.2019.2956975
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.
引用
收藏
页码:5692 / 5705
页数:14
相关论文
共 15 条
  • [1] A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition
    Cheng, Xinhua
    Jia, Mengxi
    Wang, Qian
    Zhang, Jian
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 6994 - 7004
  • [2] Visual-Textual Attribute Learning for Class-Incremental Facial Expression Recognition
    Lv, Yuanling
    Huang, Guangyu
    Yan, Yan
    Xue, Jing-Hao
    Chen, Si
    Wang, Hanzi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8038 - 8051
  • [3] Advancing Visible-Infrared Person Re-Identification: Synergizing Visual-Textual Reasoning and Cross-Modal Feature Alignment
    Qiu, Yuxuan
    Wang, Liyang
    Song, Wei
    Liu, Jiawei
    Shi, Zhiping
    Jiang, Na
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 2184 - 2196
  • [4] Sentiment Recognition for Short Annotated GIFs Using Visual-Textual Fusion
    Liu, Tianliang
    Wan, Junwei
    Dai, Xiubin
    Liu, Feng
    You, Quanzeng
    Luo, Jiebo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 1098 - 1110
  • [5] Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment
    Peng, Yuxin
    Ye, Zhaoda
    Qi, Jinwei
    Zhuo, Yunkan
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (05) : 3669 - 3683
  • [6] Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction
    Zhou, Tao
    Cao, Jiuxin
    Zhu, Xuelin
    Liu, Bo
    Li, Shancang
    IEEE SYSTEMS JOURNAL, 2021, 15 (03): : 4303 - 4314
  • [7] MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism
    Peng, Yuxin
    Qi, Jinwei
    Zhuo, Yunkan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 2728 - 2741
  • [8] Hybrid Graph Reasoning With Dynamic Interaction for Visual Dialog
    Du, Shanshan
    Wang, Hanli
    Li, Tengpeng
    Chen, Chang Wen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9095 - 9108
  • [9] Zoom-and-Reasoning: Joint Foreground Zoom and Visual-Semantic Reasoning Detection Network for Aerial Images
    Ge, Zuhao
    Qi, Lizhe
    Wang, Yuzheng
    Sun, Yunquan
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2572 - 2576
  • [10] A Mutually Textual and Visual Refinement Network for Image-Text Matching
    Pang, Shanmin
    Zeng, Yueyang
    Zhao, Jiawei
    Xue, Jianru
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7555 - 7566