Unsupervised Temporal Video Grounding with Deep Semantic Clustering

被引:0
|
作者
Liu, Daizong [1 ,2 ]
Qu, Xiaoye [2 ]
Wang, Yinzhen [3 ]
Di, Xing [4 ]
Zou, Kai [4 ]
Cheng, Yu [5 ]
Xu, Zichuan [6 ]
Zhou, Pan [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Hubei Engn Res Ctr Big Data Secur, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Hubei, Peoples R China
[4] ProtagoLabs Inc, Vienna, Austria
[5] Microsoft Res, Redmond, WA USA
[6] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
引用
收藏
页码:1683 / 1691
页数:9
相关论文
共 50 条
  • [31] Unsupervised Deep Embedding for Clustering Analysis
    Xie, Junyuan
    Girshick, Ross
    Farhadi, Ali
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [32] Optimising Video Summaries Using Unsupervised Clustering
    Ren, Kan
    Fernando, W. A. C.
    Calic, Janko
    PROCEEDINGS ELMAR-2008, VOLS 1 AND 2, 2008, : 451 - 454
  • [33] An unsupervised approach to dominant video scene clustering
    Lu, H
    Tan, YP
    PROCEEDINGS OF THE 2003 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL II: COMMUNICATIONS-MULTIMEDIA SYSTEMS & APPLICATIONS, 2003, : 680 - 683
  • [34] Unsupervised clustering of dominant scenes in sports video
    Lu, H
    Tan, YP
    PATTERN RECOGNITION LETTERS, 2003, 24 (15) : 2651 - 2662
  • [35] Deep video action clustering via spatio-temporal feature learning
    Peng, Bo
    Lei, Jianjun
    Fu, Huazhu
    Jia, Yalong
    Zhang, Zongqian
    Li, Yi
    NEUROCOMPUTING, 2021, 456 : 519 - 527
  • [36] Unsupervised semantic clustering and localization for mobile robotics tasks
    Balaska, Vasiliki
    Bampis, Loukas
    Boudourides, Moses
    Gasteratos, Antonios
    Robotics and Autonomous Systems, 2020, 131
  • [37] Unsupervised semantic clustering and localization for mobile robotics tasks
    Balaska, Vasiliki
    Bampis, Loukas
    Boudourides, Moses
    Gasteratos, Antonios
    ROBOTICS AND AUTONOMOUS SYSTEMS, 2020, 131
  • [38] INTENT DISCOVERY THROUGH UNSUPERVISED SEMANTIC TEXT CLUSTERING
    Padmasundari
    Bangalore, Srinivas
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 606 - 610
  • [39] Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning
    Li, Mengze
    Zhang, Haoyu
    Li, Juncheng
    Zhao, Zhou
    Zhang, Wenqiao
    Zhang, Shengyu
    Pu, Shiliang
    Zhuang, Yueting
    Wu, Fei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3807 - 3816
  • [40] TubeDETR: Spatio-Temporal Video Grounding with Transformers
    Yang, Antoine
    Miech, Antoine
    Sivic, Josef
    Laptev, Ivan
    Schmid, Cordelia
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16421 - 16432