共 36 条
- [21] Matsubara T., 2019, ARXIV191006514
- [22] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2641 - 2649
- [23] Radford L, 2018, ICME-13 MONOGR, P3, DOI 10.1007/978-3-319-68351-5_1
- [25] Schuster S., 2015, P 4 WORKSH VIS LANG, P70
- [26] Simonyan K, 2015, Arxiv, DOI [arXiv:1409.1556, DOI 10.48550/ARXIV.1409.1556]
- [27] Vaswani A., 2017, Advances in neural information processing systems, P6000, DOI DOI 10.48550/ARXIV.1706.03762
- [28] Velickovic Petar, 2017, STAT, P1, DOI DOI 10.48550/ARXIV.1710.10903
- [29] Adversarial Cross-Modal Retrieval [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
- [30] Wang S., 2019, ARXIV191005134