Vision talks: Visual relationship-enhanced transformer for video-guided machine translation

被引:1
|
作者
Chen, Shiyu [1 ]
Zeng, Yawen [1 ]
Cao, Da [1 ]
Lu, Shaofei [1 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine translation; Visual relationship; Transformer; Graph convolutional network;
D O I
10.1016/j.eswa.2022.118264
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-guided machine translation is a promising task which aims to translate a source language description into a target language utilizing the video information as supplementary context. The majority of existing work utilizes the whole video as the auxiliary information to enhance the translation performance. However, visual information, as a heterogeneous modal with text, introduces noise instead. Toward this end, we propose a novel visual relationship-enhanced transformer by constructing a semantic-visual relational graph as a cross-modal bridge. Specifically, the visual information is regarded as the structured conceptual representation, which builds a bridge between two modalities. Thereafter, graph convolutional network is deployed to capture the relationship among visual semantics. In this way, a transformer with structured multi-modal fusion strategy is allowed to explore the correlations. Finally, the proposed framework is optimized under the scheme of Kullback-Leibler divergence with label smoothing. Extensive experiments demonstrate the rationality and effectiveness of our proposed method as compared to other state-of-the-art solutions.
引用
收藏
页数:11
相关论文
共 22 条
  • [21] Estimation of full-field, full-order experimental modal model of cable vibration from digital video measurements with physics-guided unsupervised machine learning and computer vision
    Yang, Yongchao
    Sanchez, Lorenzo
    Zhang, Huiying
    Roeder, Alexander
    Bowlan, John
    Crochet, Jared
    Farrar, Charles
    Mascarenas, David
    STRUCTURAL CONTROL & HEALTH MONITORING, 2019, 26 (06):
  • [22] Visual Feedback-guided Breath-hold Technique for Radiotherapy using a Machine Vision System with a Charge-coupled Device Camera and a Head-mounted Display: An Evaluation of Breath-hold Reproducibility in Clinical use
    Yoshitake, T.
    Shioyama, Y.
    Ohga, S.
    Nonoshita, T.
    Ohnishi, K.
    Terashima, K.
    Asai, K.
    Hirata, H.
    Nakamura, K.
    Honda, H.
    INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2009, 75 (03): : S601 - S602