Dual-guided multi-modal bias removal strategy for temporal sentence grounding in video

被引：0

作者：

Ruan, Xiaowen ^{[1
]}

Qi, Zhaobo ^{[1
]}

Xu, Yuanrong ^{[1
]}

Zhang, Weigang ^{[1
]}

机构：

[1] Harbin Inst Technol, Sch Comp Sci & Technol, Weihai, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2025年 / 31卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Multi-modal; Bias; Dual guidance; Generator;

D O I：

10.1007/s00530-024-01587-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Temporal sentence grounding in video aims to locate target video segments related to the input textual query within a given video, which is severely affected by the biases, leading to a significant decrease in the generalization performance. However, existing debias methods only remove a limited amount of bias and do not consider the more significant multimodal bias. In this work, we verify the existence of multimodal bias through comparative experiments, and further propose a dual-guided multi-modal bias removal strategy (DMBR) to address this issue. Based on the span-based natural language video localization paradigm, DMBR extracts salient text concepts, such as verbs, nouns, and numerals, and visual concepts, such as actions contained in the input video, to guide the generation of multimodal biases, which can simulate all potential multimodal biases in a dual and complementary manner through language guided multi-modal bias generator and video guided multi-modal bias generator. Meanwhile, we produce the adversarial training paradigm. The bias generators are expected to generate multi-modal bias samples that can deceive the discriminator and the backbone network, while the backbone network aims to produce correct predictions even in the presence of biased features and the discriminator aims to accurately predict whether the sample contains bias. This strategy forces the backbone model to accurately identify and effectively remove the influence of multimodal biases, thus improving the robustness of the model. We implement our DMBR on multiple existing backbones under widely used benchmarks Charades-CD and ActivityNet-CD datasets, which demonstrate the effectiveness of our debias strategy.

引用

页数：15

共 68 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].

Agrawal, Aishwarya ;

Batra, Dhruv ;

Parikh, Devi ;

Kembhavi, Aniruddha .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980

[2] Learning Sample Importance for Cross-Scenario Video Temporal Grounding [J].

Bao, Peijun ;

Mu, Yadong .

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, :322-329

[3] Localizing Moments in Long Video Via Multimodal Guidance [J].

Barrios, Wayner ;

Soldan, Mattia ;

Ceballos-Arroyo, Alberto Mario ;

Heilbron, Fabian Caba ;

Ghanem, Bernard .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :13621-13632

[4]

Cadene R, 2019, ADV NEUR IN, V32

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[7]

Chen JY, 2019, AAAI CONF ARTIF INTE, P8175

[8]

Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162

[9]

Clark C, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4069

[10] A too-good-to-be-true prior to reduce shortcut reliance [J].

Dagaev, Nikolay ;

Roads, Brett D. ;

Luo, Xiaoliang ;

Barry, Daniel N. ;

Patil, Kaustubh R. ;

Love, Bradley C. .

PATTERN RECOGNITION LETTERS, 2023, 166 :164-171

← 1 2 3 4 5 6 7 →