Fine-grained Multimodal Entity Linking for Videos

被引:0
作者
Zhao H.-Q. [1 ,2 ]
Wang X.-W. [1 ,2 ]
Li J.-L. [3 ]
Li Z.-X. [1 ,2 ]
Xiao Y.-H. [1 ,2 ]
机构
[1] School of Computer Science, Fudan University, Shanghai
[2] Shanghai Key Laboratory of Data Science, Fudan University, Shanghai
[3] School of Computer Science and Technology, Soochow University, Suzhou
来源
Ruan Jian Xue Bao/Journal of Software | 2024年 / 35卷 / 03期
关键词
contrastive learning; dataset; fine-grained; large language model; video entity linking;
D O I
10.13328/j.cnki.jos.007078
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the rapid development of the Internet and big data, the scale and variety of data are increasing. Video, as an important form of information, is becoming increasingly prevalent, particularly with the recent growth of short videos. Understanding and analyzing large-scale videos has become a hot topic of research. Entity linking, as a way of enriching background knowledge, can provide a wealth of external information. Entity linking in videos can effectively assist in understanding the content of video, enabling classification, retrieval, and recommendation of video content. However, the granularity of existing video linking datasets and methods is too coarse. Therefore, this study proposes a video-based fine-grained entity linking approach, focusing on live streaming scenarios, and constructs a fine-grained video entity linking dataset. Additionally, based on the challenges of fine-grained video linking tasks, this study proposes the use of large models to extract entities and their attributes from videos, as well as utilizing contrastive learning to obtain better representations of videos and their corresponding entities. The results demonstrate that the proposed method can effectively handle fine-grained entity linking tasks in videos. © 2024 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:1140 / 1153
页数:13
相关论文
共 43 条
[21]  
Bain M, Nagrani A, Varol G, Et al., Frozen in time: A joint video and image encoder for end-to-end retrieval, Proc. of the IEEE/ CVF Int’l Conf. on Computer Vision, pp. 1728-1738, (2021)
[22]  
Miech A, Zhukov D, Alayrac JB, Et al., HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips, Proc. of the IEEE/CVF Int’l Conf. on Computer Vision, pp. 2630-2640, (2019)
[23]  
Miech A, Alayrac JB, Smaira L, Et al., End-to-end learning of visual representations from uncurated instructional videos, Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9879-9889, (2020)
[24]  
Wu W, Sun Z, Ouyang W., Revisiting classifier: Transferring vision-language models for video recognition, Proc. of the AAAI Conf. on Artificial Intelligence, 37, 3, pp. 2847-2855, (2023)
[25]  
Zhao S, Zhu L, Wang X, Et al., CenterCLIP: Token clustering for efficient text-video retrieval, Proc. of the 45th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 970-981, (2022)
[26]  
Bain M, Nagrani A, Varol G, Et al., A CLIP-Hitchhiker’s guide to long video retrieval, (2022)
[27]  
Vaswani A, Shazeer N, Parmar N, Et al., Attention is all you need, Advances in Neural Information Processing Systems, pp. 5998-6008, (2017)
[28]  
Devlin J, Chang MW, Lee K, Et al., Bert: Pre-training of deep bidirectional transformers for language understanding, (2018)
[29]  
Radford A, Narasimhan K, Salimans T, Et al., Improving language understanding by generative pre-training, (2018)
[30]  
Radford A, Wu J, Child R, Et al., Language models are unsupervised multitask learners