共 122 条
[1]
[Anonymous], 2015, ICLR
[2]
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473,1409.0473, DOI 10.48550/ARXIV.1409.0473,1409.0473]
[3]
G3RAPHGROUND: Graph-based Language Grounding
[J].
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019),
2019,
:4280-4289
[4]
Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
[5]
End-to-End Referring Video Object Segmentation with Multimodal Transformers
[J].
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022),
2022,
:4975-4985
[6]
One-Shot Video Object Segmentation
[J].
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017),
2017,
:5320-5329
[7]
End-to-End Object Detection with Transformers
[J].
COMPUTER VISION - ECCV 2020, PT I,
2020, 12346
:213-229
[8]
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
[J].
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017),
2017,
:4724-4733
[9]
UNITER: UNiversal Image-TExt Representation Learning
[J].
COMPUTER VISION - ECCV 2020, PT XXX,
2020, 12375
:104-120
[10]
Child R, 2019, Arxiv, DOI [arXiv:1904.10509, 10.48550/arXiv.1904.10509]