共 41 条
[31]
Connecting Modalities: Semi-supervised Segmentation and Annotation of Images Using Unaligned Text Corpora
[J].
2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR),
2010,
:966-973
[32]
Su WJ, 2020, Arxiv, DOI [arXiv:1908.08530, DOI 10.48550/ARXIV.1908.08530]
[33]
Suhr A, 2019, Arxiv, DOI arXiv:1811.00491
[34]
VideoBERT: A Joint Model for Video and Language Representation Learning
[J].
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019),
2019,
:7463-7472
[35]
Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100
[36]
van der Maaten L, 2008, J MACH LEARN RES, V9, P2579
[37]
Vaswani A, 2017, ADV NEUR IN, V30
[38]
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources
[J].
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR),
2016,
:4622-4630
[39]
Image Captioning with Semantic Attention
[J].
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR),
2016,
:4651-4659
[40]
Young P., 2014, Transactions of the Association for Computational Linguistics, V2, P67, DOI DOI 10.1162/TACL_A_00166