DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

被引：18

作者：

Jin, Peng ^{[1
,3
]}

Li, Hao ^{[1
,3
]}

Cheng, Zesen ^{[1
,3
]}

Li, Kehan ^{[1
,3
]}

Ji, Xiangyang ^{[4
]}

Liu, Chang ^{[4
]}

Yuan, Li ^{[1
,2
,3
]}

Chen, Jie ^{[1
,2
,3
]}

机构：

[1] Peking Univ, Sch Elect & Comp Engn, Shenzhen, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Peking Univ, Shenzhen Grad Sch, AI Sci AI4S Preferred Program, Shenzhen, Peoples R China

[4] Tsinghua Univ, Dept Automat & BNRist, Beijing, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/ICCV51070.2023.00234

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates, query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code is available at https://github.com/jpthu17/DiffusionRet.

引用

页码：2470 / 2481

页数：12

共 73 条

[1]

Amit Tomer, 2021, ARXIV211200390

[2]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00504

[3]

[Anonymous], 2019, NeurIPS

[4]

[Anonymous], 2022, TIP, DOI DOI 10.1109/TIP.2022.3203612

[5]

Austin Jacob, 2021, ADV NEURAL INFORM PR, V34, P17981

[6] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[7]

Baranchuk Dmitry, 2021, ARXIV211203126

[8]

Bernardo J., 2007, Bayesian Statistics, V8, P3

[9] Cross Modal Retrieval with Querybank Normalisation [J].

Bogolin, Simion-Vlad ;

Croitoru, Ioana ;

Jin, Hailin ;

Liu, Yang ;

Albanie, Samuel .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5184-5195

[10]

BREMPONG EA, 2022, CVPR, P4174, DOI DOI 10.1109/CVPRW56347.2022.00462

← 1 2 3 4 5 6 7 8 →