Who, What, and Where: Composite-Semantics Instance Search for Story Videos

被引:0
|
作者
Guo, Jiahao [1 ,2 ]
Lu, Ankang [1 ,2 ]
Wu, Zhengqian [1 ,2 ]
Wang, Zhongyuan [1 ,2 ]
Liang, Chao [1 ,2 ]
机构
[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software NERCMS, Hubei Key Lab Multimedia & Network Commun Engn, Wuhan 430072, Peoples R China
[2] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Peoples R China
基金
中国国家自然科学基金;
关键词
Videos; Semantics; Correlation; TV; Feature extraction; Chaos; Training; Support vector machines; Search problems; NIST; Who-what-where; instance search; video structure aware; partial decomposition;
D O I
10.1109/TIP.2025.3542272
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Who, What and Where (3W)are the three core elements of storytelling, and accurately identifying the 3W semantics is critical to understanding the story in a video. This paper studies the 3W composite-semantics video Instance Search (INS) problem, which aims to find video shots about a specific person doing a concrete action in a particular location. The popular Complete-Decomposition (CD) methods divide a composite-semantics query into multiple single-semantics queries, which are likely to yield inaccurate or incomplete retrieval results due to neglecting important semantic correlations. Recent Non-Decomposition (ND) methods utilize Vision Language Model (VLM) to directly measure the similarity between textual query and video content. However, the accuracy is limited by VLM's immature capability to recognize fine-grained objects. To address the above challenges, we propose a video structure-aware Partial-Decomposition (PD) method. Its core idea is to partially decompose the 3W INS problem into three semantic-correlated 2W INS problems i.e., person-action INS, action-location INS, and location-person INS. Thereafter, we respectively model the correlations between pairs of semantics at frames, shots and scenes of story videos. With the help of the spatial consistency and temporal continuity contained in the unique hierarchical structure of story videos, we can finally obtain identity-matching, logic-consistent, and content-coherent 3W INS results. To validate the effectiveness of the proposed method, we specifically build three large-scale 3W INS datasets based on three TV series Eastenders, Friends and The Big Bang Theory, totally comprising over 670K video shots spanning 700 hours. Extensive experiments show that the proposed PD method surpasses the current state-of-the-art CD and ND methods for 3W INS in story videos.
引用
收藏
页码:1412 / 1426
页数:15
相关论文
共 1 条
  • [1] Who, What and Where: Composite-semantic Instance Search for Story Videos
    Guo, Jiahao
    Liang, Chao
    Wang, Zhongyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 858 - 863