Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

被引:0
作者
Nebbia, Giacomo [1 ]
Kovashka, Adriana [1 ]
机构
[1] Univ Pittsburgh, Pittsburgh, PA USA
来源
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023 | 2023年
基金
美国国家科学基金会;
关键词
grounding; hypernymization; named entities; open-vocabulary detection;
D O I
10.1145/3591106.3592223
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Named entities are ubiquitous in text that naturally accompanies images, especially in domains such as news or Wikipedia articles. In previous work, named entities have been identified as a likely reason for low performance of image-text retrieval models pretrained on Wikipedia and evaluated on named entities-free benchmark datasets. Because they are rarely mentioned, named entities could be challenging to model. They also represent missed learning opportunities for self-supervised models: the link between named entity and object in the image may be missed by the model, but it would not be if the object were mentioned using a more common term. In this work, we investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models and for fine-tuning on open-vocabulary detection. We propose two ways to perform hypernymization: (1) a "manual" pipeline relying on a comprehensive ontology of concepts, and (2) a "learned" approach where we train a language model to learn to perform hypernymization. We run experiments on data from Wikipedia and from The New York Times. We report improved pretraining performance on objects of interest following hypernymization, and we show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training.
引用
收藏
页码:67 / 75
页数:9
相关论文
共 39 条
  • [1] Akbik A, 2019, NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE DEMONSTRATIONS SESSION, P54
  • [2] Zero-Shot Object Detection
    Bansal, Ankan
    Sikka, Karan
    Sharma, Gaurav
    Chellappa, Rama
    Divakaran, Ajay
    [J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 397 - 414
  • [3] Bird S., 2009, NATURAL LANGUAGE PRO
  • [4] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
    Changpinyo, Soravit
    Sharma, Piyush
    Ding, Nan
    Soricut, Radu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
  • [5] Chen XL, 2015, Arxiv, DOI [arXiv:1504.00325, DOI 10.48550/ARXIV.1504.00325]
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
    Du, Yu
    Wei, Fangyun
    Zhang, Zihe
    Shi, Miaojing
    Gao, Yue
    Li, Guoqi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14064 - 14073
  • [8] How Well Do Self-Supervised Models Transfer?
    Ericsson, Linus
    Gouk, Henry
    Hospedales, Timothy M.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5410 - 5419
  • [9] Finkel Jenny Rose, 2005, P 43 ANN M ASS COMP, P363
  • [10] Good News, Everyone! Context Driven Entity-Aware Captioning for News Images
    Furkan Biten, Ali
    Gomez, Lluis
    Rusinol, Marcal
    Karatzas, Dimosthenis
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12458 - 12467