Unsupervised Temporal Video Grounding with Deep Semantic Clustering

被引:0
|
作者
Liu, Daizong [1 ,2 ]
Qu, Xiaoye [2 ]
Wang, Yinzhen [3 ]
Di, Xing [4 ]
Zou, Kai [4 ]
Cheng, Yu [5 ]
Xu, Zichuan [6 ]
Zhou, Pan [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Hubei Engn Res Ctr Big Data Secur, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Hubei, Peoples R China
[4] ProtagoLabs Inc, Vienna, Austria
[5] Microsoft Res, Redmond, WA USA
[6] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
引用
收藏
页码:1683 / 1691
页数:9
相关论文
共 50 条
  • [11] Unsupervised semantic deep hashing
    Jin, Sheng
    Yao, Hongxun
    Sun, Xiaoshuai
    Zhou, Shangchen
    NEUROCOMPUTING, 2019, 351 (19-25) : 19 - 25
  • [12] Unsupervised Semantic Parsing of Video Collections
    Sener, Ozan
    Zamir, Amir R.
    Savarese, Silvio
    Saxena, Ashutosh
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4480 - 4488
  • [13] Unsupervised Data-driven Automotive Diagnostics with Improved Deep Temporal Clustering
    Wolf, Peter
    Chin, Alvin
    Baeker, Bernard
    2019 IEEE 90TH VEHICULAR TECHNOLOGY CONFERENCE (VTC2019-FALL), 2019,
  • [14] Unsupervised clustering of ambulatory audio and video
    Clarkson, Brian
    Pentland, Alex
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1999, 6 : 3037 - 3040
  • [15] Unsupervised clustering of ambulatory audio and video
    Clarkson, B
    Pentland, A
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 3037 - 3040
  • [16] UNSUPERVISED DEEP HASHING WITH DEEP SEMANTIC DISTILLATION
    Zhao, Chuang
    Ling, Hefei
    Shi, Yuxuan
    Gu, Bo
    Lu, Shijie
    Li, Ping
    Cao, Qiang
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2280 - 2284
  • [17] Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation
    Qiu, Zhaofan
    Yao, Ting
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (04) : 939 - 949
  • [18] Unsupervised Deep Video Denoising
    Sheth, Dev Yashpal
    Mohan, Sreyas
    Vincent, Joshua L.
    Manzorro, Ramon
    Crozier, Peter A.
    Khapra, Mitesh M.
    Simoncelli, Eero P.
    Fernandez-Granda, Carlos
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1739 - 1748
  • [19] An empirical study of the effect of video encoders on Temporal Video Grounding
    De la Jara, Ignacio M.
    Rodriguez-Opazo, Cristian
    Marrese-Taylor, Edison
    Bravo-Marquez, Felipe
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2842 - 2847
  • [20] Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
    Tan, Chaolei
    Lin, Zihang
    Hu, Jian-Fang
    Zheng, Wei-Shi
    Lai, Jianhuang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18973 - 18982