Dense video captioning using unsupervised semantic information

被引:0
|
作者
Estevam, Valter [1 ,2 ]
Laroca, Rayson [2 ,3 ]
Pedrini, Helio [4 ]
Menotti, David [2 ]
机构
[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil
[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil
[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil
[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil
关键词
Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;
D O I
10.1016/j.jvcir.2024.104385
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] UnDER: Unsupervised Dense Point Cloud Extraction Routine for UAV Imagery Using Deep Learning
    Bergado, John Ray
    Nex, Francesco
    REMOTE SENSING, 2025, 17 (01)
  • [32] Generalized unsupervised functional map learning for dense correspondence
    Han, Li
    Shi, Xue
    He, Jinhai
    Ma, Huiwen
    Dou, Feng
    Zhao, Hongkai
    VISUAL COMPUTER, 2023, 39 (12): : 6625 - 6638
  • [33] Generalized unsupervised functional map learning for dense correspondence
    Li Han
    Xue Shi
    Jinhai He
    Huiwen Ma
    Feng Dou
    Hongkai Zhao
    The Visual Computer, 2023, 39 : 6625 - 6638
  • [34] Unsupervised mining of statistical temporal structures in video
    Xie, LX
    Chang, SF
    Divakaran, A
    Sun, HF
    VIDEO MINING, 2003, 6 : 279 - 307
  • [35] Unsupervised Learning of Functional Categories in Video Scenes
    Turek, Matthew W.
    Hoogs, Anthony
    Collins, Roderic
    COMPUTER VISION-ECCV 2010, PT II, 2010, 6312 : 664 - 677
  • [36] Benchmarking Unsupervised Object Representations for Video Sequences
    Weis, Marissa A.
    Chitta, Kashyap
    Sharma, Yash
    Brendel, Wieland
    Bethge, Matthias
    Geiger, Andreas
    Ecker, Alexander S.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2021, 22
  • [37] Extracting Semantic Knowledge From GANs With Unsupervised Learning
    Xu, Jianjin
    Zhang, Zhaoxiang
    Hu, Xiaolin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 9654 - 9668
  • [38] USMART: An Unsupervised Semantic Mining Activity Recognition Technique
    Ye, Juan
    Stevenson, Graeme
    Dobson, Simon
    ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS, 2015, 4 (04)
  • [39] UESTS: An Unsupervised Ensemble Semantic Textual Similarity Method
    Hassan, Basma
    Abdelrahman, Samir E.
    Bahgat, Reem
    Farag, Ibrahim
    IEEE ACCESS, 2019, 7 : 85462 - 85482
  • [40] Unsupervised Semantic Mapping for Healthcare Data Storage Schema
    Satti, Fahad Ahmed
    Hussain, Musarrat
    Hussain, Jamil
    Ali, Syed Imran
    Ali, Taqdir
    Bilal, Hafiz Syed Muhammad
    Chung, Taechoong
    Lee, Sungyoung
    IEEE ACCESS, 2021, 9 : 107267 - 107278