Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval

被引：0

作者：

Chen, Tongbao ^{[1
,2
]}

Wang, Wenmin ^{[1
]}

Jiang, Zhe ^{[1
,3
]}

Li, Ruochen ^{[1
]}

Wang, Bingshu ^{[4
]}

机构：

[1] Macau Univ Sci & Technol, Sch Comp Sci & Engn, Taipa 999078, Macau, Peoples R China

[2] Guangdong Univ Technol, Sch Adv Mfg, Jieyang 515200, Peoples R China

[3] Guilin Coll Aerosp Technol, Sch Comp Sci & Engn, Guilin 541004, Peoples R China

[4] Northwestern Polytech Univ, Sch Software, Xian 710072, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Visualization; Task analysis; Database languages; Semantics; Pipelines; Calibration; Transformers; Cross-modality; calibration; transformer; video corpus moment retrieval;

D O I：

10.1109/TMM.2023.3316025

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video corpus moment retrieval has become a hot topic recently, which aims to localize a consequent video moments highly relevant to the given query language description from video corpus. Existing methods towards this challenging task are suffering from the cases when the visual information and textual information in the video are very different from each other or from the cases where the redundant video content is semantically irrelevant with the query language description, which make the model confused of figuring out the truly useful within- and cross-modality information. In this article, we propose a novel Cross-Modality Knowledge Calibration Network (CKCN) to solve the issue mentioned above. Specifically, a dual calibration transformer module with improved multi-head attention is proposed to simultaneously capture the within- and cross-modality features between the visual and textual modality of the video automatically compressing the redundant information, and then a query-dependent fusion module is designed to guide feature fusion of the video's multi-modal information using the prior knowledge of query which further refine more important modality features. At last, a query-guided calibration transformer module with a well-designed learnable cell is utilized to align the query and video, forming a single joint representation for moment localization. Meanwhile, we introduce transfer learning into the task of video corpus moment retrieval (VCMR) for the first time to solve the defect of insufficient labeled data. Extensive experiments have been conducted on both the widely used TVR dataset and DiDeMo dataset which have achieved new state-of-the-art, thus verifying the effectiveness of our proposed CKCN.

引用

页码：3799 / 3813

页数：15

共 56 条

[1]

Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/TPAMI.2017.2711011, 10.1109/CVPR.2016.572]

[2]

Ba J.L., 2016, arXiv, DOI DOI 10.48550/ARXIV.1607.06450

[3]

Chen JingLei, 2018, INT C SIGNAL PROCESS

[4]

Chen SX, 2019, AAAI CONF ARTIF INTE, P8199

[5] A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [J].

Chen, Yutong ;

Wei, Fangyun ;

Sun, Xiao ;

Wu, Zhirong ;

Lin, Stephen .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5110-5120

[6]

Chen Z., 2021, INT C LEARN REPRESEN

[7]

Cheng Z.Q., 2017, P 2017 ACM INT C MUL, P287

[8] Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images [J].

Cheng, Zhi-Qi ;

Wu, Xiao ;

Liu, Yang ;

Hua, Xian-Sheng .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4169-4177

[9] Video eCommerce plus plus : Toward Large Scale Online Video Advertising [J].

Cheng, Zhi-Qi ;

Wu, Xiao ;

Liu, Yang ;

Hua, Xian-Sheng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (06) :1170-1183

[10]

Clark C, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P845

← 1 2 3 4 5 6 →