Problematic Unordered Queries in Temporal Moment Measurement by Using Natural Language

被引:0
作者
Nawaz, Hafiza Sadia [1 ]
Dong, Junyu [1 ]
机构
[1] Ocean Univ China, Dept Comp Sci & Informat, Qingdao 266104, Shandong, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Query processing; Natural language processing; Grammar; Feature extraction; Task analysis; Location awareness; Moment measurement via language query in video; temporal moment localization using natural language; cross-modal interactions; single moment retrieval; LOCALIZATION; NETWORK;
D O I
10.1109/ACCESS.2023.3264443
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study examines the difficulty in measuring temporal moments by using natural language (TMMNL) in the untrimmed video. The purpose of TMMNL is to use natural language query to find a specific moment within a lengthy video. It's a challenging task since other, closely related activities may divert attention from the target temporal moment. This issue has been addressed by existing research employing computer vision techniques like reinforcement, anchor, and ranking. In this research, we not only propose a TMMNL solution and show how to use natural language query to find the required moment, but we also identify a novel issue: if the given natural language query is unordered (without proper subject, verb and object), the system will have trouble understanding and the network may perform poorly. Previous methods perform poorer when a query is unordered and cannot able to retrieve the relevant moment as a result overall performance degradation. We introduce novel the-visual, the-action, the-object, and the-connecting words concept to address the problem of unordered query in TMMNL. Graph Convolutions with Latent variable for Visual-Textual Network (GCL-VTN) is our suggested network, which has three components: 1) visual-graph-convolution (visual GC); 2) textual-graph-convolution (textual GC); and 3) compatible method for learning embedding's (CMLE). Visual-nodes in the visual GC detect regional attributes, object, and actor information in the same way as textual-nodes in the textual GC maintain word sequence using grammar-based query rules. A compatible method for learning embedding (CMLE) is also proposed, which integrates different modalities (moment, query) and trained grammar-based words into the same embedding space. In order to align and keep the query sequence, we also incorporate a stochastic latent variable in the CMLE that has prior and posterior distributions. The posterior distribution deals with both visual-textual data and works when the query is in the correct sequence or based on grammar rules. The prior distribution only deals with textual data and is effective when the query is unordered or not based on grammar rules. TACoS, Charades-STA, and activityNet-Captions are state-of-the-art; our GCL-VTN exceeds them all.
引用
收藏
页码:37976 / 37986
页数:11
相关论文
共 59 条
  • [1] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [2] STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization
    Cao, Da
    Zeng, Yawen
    Liu, Meng
    He, Xiangnan
    Wang, Meng
    Qin, Zheng
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4162 - 4170
  • [3] VideoQ: An automated content based video search system using visual cues
    Chang, SF
    Chen, W
    Meng, HJ
    Sundaram, H
    Zhong, D
    [J]. ACM MULTIMEDIA 97, PROCEEDINGS, 1997, : 313 - 324
  • [4] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
  • [5] Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551
  • [6] Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
  • [7] Cho KYHY, 2014, Arxiv, DOI [arXiv:1406.1078, 10.48550/arXiv.1406.1078.]
  • [8] Temporal Context Network for Activity Localization in Videos
    Dai, Xiyang
    Singh, Bharat
    Zhang, Guyue
    Davis, Larry S.
    Chen, Yan Qiu
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5727 - 5736
  • [9] Danafar S, 2010, LECT NOTES ARTIF INT, V6321, P264, DOI 10.1007/978-3-642-15880-3_23
  • [10] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497