Hierarchical Local-Global Transformer for Temporal Sentence Grounding

被引:5
作者
Fang, Xiang [1 ]
Liu, Daizong [2 ]
Zhou, Pan [1 ]
Xu, Zichuan [3 ]
Li, Ruixuan [4 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Hubei Engn Res Ctr Big Data Secur, Wuhan, Peoples R China
[2] Peking Univ, Wangxuan Inst Comp Technol, Beijing 100080, Peoples R China
[3] Dalian Univ Technol, Sch Software, Dalian 116024, Peoples R China
[4] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Visualization; Grounding; Task analysis; Feature extraction; Decoding; Multi-modal representations; multimedia understanding; temporal sentence grounding; temporal transformer; NETWORK;
D O I
10.1109/TMM.2023.3309551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This article studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this article, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets (ActivityNet Captions, Charades-STA and TACoS) show that our proposed HLGT achieves a new state-of-the-art performance, demonstrating its effectiveness and computational efficiency.
引用
收藏
页码:3263 / 3277
页数:15
相关论文
共 94 条
[1]   Dense Video Captioning With Early Linguistic Information Fusion [J].
Aafaq, Nayyer ;
Mian, Ajmal ;
Akhtar, Naveed ;
Liu, Wei ;
Shah, Mubarak .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :2309-2322
[2]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[3]  
Cao M, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9810
[4]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[5]   ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes [J].
Chen, Jing ;
Manivannan, Madhavan ;
Abduljabbar, Mustafa ;
Pericas, Miquel .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (02)
[6]  
Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[7]  
Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551
[8]  
Chen WX, 2016, IEEE INT SYM MULTIM, P367, DOI [10.1109/ISM.2016.106, 10.1109/ISM.2016.0081]
[9]  
Ding J., 2021, arXiv
[10]  
Dosovitskiy A., 2021, 9 INT C LEARN REPR I