共 54 条
[31]
Lin Yan-Bo, 2022, ARXIV220402874, P8
[32]
12-in-1: Multi-Task Vision and Language Representation Learning
[J].
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020),
2020,
:10434-10443
[34]
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
[J].
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019),
2019,
:2630-2640
[35]
Mnih V, 2014, ADV NEUR IN, V27
[36]
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering
[J].
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021,
2021,
:15521-15530
[37]
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:2065-2074
[38]
Radford A, 2021, PR MACH LEARN RES, V139
[39]
Sharma P, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2556
[40]
MovieQA: Understanding Stories in Movies through Question-Answering
[J].
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR),
2016,
:4631-4640