Unsupervised Temporal Video Grounding with Deep Semantic Clustering

被引：0

作者：

Liu, Daizong ^{[1
,2
]}

Qu, Xiaoye ^{[2
]}

Wang, Yinzhen ^{[3
]}

Di, Xing ^{[4
]}

Zou, Kai ^{[4
]}

Cheng, Yu ^{[5
]}

Xu, Zichuan ^{[6
]}

Zhou, Pan ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Hubei Engn Res Ctr Big Data Secur, Wuhan, Hubei, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Hubei, Peoples R China

[4] ProtagoLabs Inc, Vienna, Austria

[5] Microsoft Res, Redmond, WA USA

[6] Dalian Univ Technol, Dalian, Peoples R China

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.

引用

页码：1683 / 1691

页数：9

共 50 条

[1] Temporal video segmentation using unsupervised clustering and semantic object tracking
Günsel, B
Ferman, AM
Tekalp, AM
JOURNAL OF ELECTRONIC IMAGING, 1998, 7 (03) : 592 - 604
[2] HiSA: Hierarchically Semantic Associating for Video Temporal Grounding
Xu, Zhe
Chen, Da
Wei, Kun
Deng, Cheng
Xue, Hui
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5178 - 5188
[3] Deep Semantic and Attentive Network for Unsupervised Video Summarization
Zhong, Sheng-Hua
Lin, Jingxu
Lu, Jianglin
Fares, Ahmed
Ren, Tongwei
ACM Transactions on Multimedia Computing, Communications and Applications, 2022, 18 (02)
[4] ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
Wang, Lan
Mittal, Gaurav
Sajeev, Sandra
Yu, Ye
Hall, Matthew
Boddeti, Vishnu Naresh
Chen, Mei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6575 - 6585
[5] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding
Zhang, Tong
Fang, Hao
Zhang, Hao
Gao, Jialin
Lu, Xiankai
Nie, Xiushan
Yin, Yilong
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9268 - 9279
[6] Recurrent Temporal Deep Field for Semantic Video Labeling
Lei, Peng
Todorovic, Sinisa
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 302 - 317
[7] Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition
Wang, Weikang
Liu, Jing
Su, Yuting
Nie, Weizhi
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4867 - 4876
[8] Unsupervised semantic clustering of Twitter hashtags
Vicient, Carlos
Moreno, Antonio
21ST EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2014), 2014, 263 : 1119 - 1120
[9] Unsupervised Clustering Guided Semantic Segmentation
Huang, Qin
Xia, Chunyang
Li, Siyang
Wang, Ye
Song, Yuhang
Kuo, C. -C. Jay
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1489 - 1498
[10] Unsupervised Video Prediction Network with Spatio-temporal Deep Features
Jin, Beibei
Zhou, Rong
Zhang, Zhisheng
Dai, Min
PROCEEDINGS OF THE 2018 25TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND MACHINE VISION IN PRACTICE (M2VIP), 2018, : 19 - 24

← 1 2 3 4 5 →