共 130 条
[1]
MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
[J].
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW,
2023,
:4659-4664
[2]
Alicioglu G., 2021, Comput. Graph., V102
[3]
Anwer RM, 2023, arXiv, DOI DOI 10.48550/ARXIV.2307.13721
[4]
ViViT: A Video Vision Transformer
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:6816-6826
[5]
Bai ZY, 2023, ADV NEUR IN
[6]
Baumli K, 2024, Arxiv, DOI arXiv:2312.09187
[7]
Chen D., 2021, Keyword-aware multi-modal enhancement attention for video question answering, P128, DOI 10.11453507548.3507567
[9]
Choi M, 2024, Arxiv, DOI arXiv:2403.11021
[10]
Colas Anthony., 2019, TutorialVQA: Question answering dataset for tutorial videos