Bilingual video captioning model for enhanced video retrieval

被引:1
|
作者
Alrebdi, Norah [1 ]
Al-Shargabi, Amal A. [1 ]
机构
[1] Qassim Univ, Coll Comp, Dept Informat Technol, Buraydah 51452, Saudi Arabia
关键词
Artificial intelligence; Computer vision; Natural language processing; Video retrieval; English video captioning; Arabic video captioning; LANGUAGE; NETWORK; VISION; TEXT;
D O I
10.1186/s40537-024-00878-w
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Many video platforms rely on the descriptions that uploaders provide for video retrieval. However, this reliance may cause inaccuracies. Although deep learning-based video captioning can resolve this problem, it has some limitations: (1) traditional keyframe extraction techniques do not consider video length/content, resulting in low accuracy, high storage requirements, and long processing times; (2) Arabic language support in video captioning is not extensive. This study proposes a new video captioning approach that uses an efficient keyframe extraction method and supports both Arabic and English. The proposed keyframe extraction technique uses time- and content-based approaches for better quality captions, fewer storage space requirements, and faster processing. The English and Arabic models use a sequence-to-sequence framework with long short-term memory in both the encoder and decoder. Both models were evaluated on caption quality using four metrics: bilingual evaluation understudy (BLEU), metric for evaluation of translation with explicit ORdering (METEOR), recall-oriented understudy of gisting evaluation (ROUGE-L), and consensus-based image description evaluation (CIDE-r). They were also evaluated using cosine similarity to determine their suitability for video retrieval. The results demonstrated that the English model performed better with regards to caption quality and video retrieval. In terms of BLEU, METEOR, ROUGE-L, and CIDE-r, the English model scored 47.18, 30.46, 62.07, and 59.98, respectively, whereas the Arabic model scored 21.65, 36.30, 44.897, and 45.52, respectively. According to the video retrieval, the English and Arabic models successfully retrieved 67% and 40% of the videos, respectively, with 20% similarity. These models have potential applications in storytelling, sports commentaries, and video surveillance.
引用
收藏
页数:24
相关论文
共 50 条
  • [21] Deep learning based, a new model for video captioning
    Department of Computer Engineering, Faculty of Engineering Gazi University, Ankara, Turkey
    1600, Science and Information Organization (11):
  • [22] From Video to Language: Survey of Video Captioning and Description
    Tang P.-J.
    Wang H.-L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (02): : 375 - 397
  • [23] An enhanced query model for soccer video retrieval using temporal relationships
    Chen, SC
    Shyu, ML
    Zhao, N
    ICDE 2005: 21ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2005, : 1133 - 1134
  • [24] Watch It Twice: Video Captioning with a Refocused Video Encoder
    Shi, Xiangxi
    Cai, Jianfei
    Joty, Shafiq
    Gu, Jiuxiang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826
  • [25] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [26] A Review Of Video Captioning Methods
    Mahajan, Dewarthi
    Bhosale, Sakshi
    Nighot, Yash
    Tayal, Madhuri
    INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2021, 12 (05): : 708 - 715
  • [27] Video Captioning of Future Frames
    Hosseinzadeh, Mehrdad
    Wang, Yang
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 979 - 988
  • [28] Sequence in sequence for video captioning
    Wang, Huiyun
    Gao, Chongyang
    Han, Yahong
    PATTERN RECOGNITION LETTERS, 2020, 130 (130) : 327 - 334
  • [29] Video Captioning with Listwise Supervision
    Liu, Yuan
    Li, Xue
    Shi, Zhongchao
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4197 - 4203
  • [30] Streamlined Dense Video Captioning
    Mun, Jonghwan
    Yang, Linjie
    Ren, Zhou
    Xu, Ning
    Han, Bohyung
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3581 - +