Dense video captioning using unsupervised semantic information

被引：0

作者：

Estevam, Valter ^{[1
,2
]}

Laroca, Rayson ^{[2
,3
]}

Pedrini, Helio ^{[4
]}

Menotti, David ^{[2
]}

机构：

[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil

[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil

[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil

[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil

来源：

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION | 2025年 / 107卷

关键词：

Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;

D O I：

10.1016/j.jvcir.2024.104385

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

引用

页数：10

共 50 条

[41] Consistency Regularization for Unsupervised Domain Adaptation in Semantic Segmentation
Scherer, Sebastian
Brehm, Stephan
Lienhart, Rainer
IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT I, 2022, 13231 : 500 - 511
[42] Information fusion for unsupervised image segmentation using stochastic watershed and Hessian matrix
Chahine, Chaza
Vachier-Lagorre, Corinne
Chenoune, Yasmina
El Berbari, Racha
El Fawal, Ziad
Petit, Eric
IET IMAGE PROCESSING, 2018, 12 (04) : 525 - 531
[43] Enhancing Video Anomaly Detection Using a Transformer Spatiotemporal Attention Unsupervised Framework for Large Datasets
Habeb, Mohamed H.
Salama, May
Elrefaei, Lamiaa A.
ALGORITHMS, 2024, 17 (07)
[44] Unsupervised Monocular Depth Estimation Based on Dense Feature Fusion
Chen Ying
Wang Yiliang
JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (10) : 2976 - 2984
[45] Unsupervised video-based lane detection using location-enhanced topic models
Sun, Hao
Wang, Cheng
Wang, Boliang
El-Sheimy, Naser
OPTICAL ENGINEERING, 2010, 49 (10)
[46] A performing analysis of unsupervised dense matching feature extraction networks
Jin F.
Guan K.
Liu Z.
Han J.
Rui J.
Li Q.
Cehui Xuebao/Acta Geodaetica et Cartographica Sinica, 2022, 51 (03): : 426 - 436
[47] An Unsupervised Sentiment Information Identification Approach
Xu, Panpan
Jin, Huilan
Shi, Hanxiao
Chen, Wei
INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 3330 - +
[48] UDF-GAN: Unsupervised dense optical-flow estimation using cycle Generative Adversarial Networks
Liu, Xiaochen
Zhang, Tao
Liu, Mingming
KNOWLEDGE-BASED SYSTEMS, 2023, 271
[49] Towards Personalized Video Summarization using Synchronized Comments and Probabilistic Latent Semantic Analysis
Chung, Cheng-Tao
Hsiung, Hsin-Kuan
Wei, Cheng-Kuang
Lee, Lin-shan
2014 IEEE 3RD GLOBAL CONFERENCE ON CONSUMER ELECTRONICS (GCCE), 2014, : 414 - 415
[50] Joint Attention Mechanism for Unsupervised Video Object Segmentation
Yao, Rui
Xu, Xin
Zhou, Yong
Zhao, Jiaqi
Fang, Liang
PATTERN RECOGNITION AND COMPUTER VISION, PT I, 2021, 13019 : 154 - 165

← 1 2 3 4 5 →