OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

被引：5

作者：

Inacio, Andrei De Souza ^{[1
,2
]}

Gutoski, Matheus ^{[1
]}

Lazzaretti, Andre Eugenio ^{[1
]}

Lopes, Heitor Silverio ^{[1
]}

机构：

[1] Univ Tecnol Fed Parana, Grad Program Elect Engn & Ind Informat, BR-80230901 Curitiba, Parana, Brazil

[2] Fed Inst Santa Catarina, BR-89111009 Gaspar, SC, Brazil

来源：

IEEE ACCESS | 2021年 / 9卷

关键词：

Videos; Task analysis; Visualization; Feature extraction; Deep learning; Training; Proposals; Video captioning; open-set recognition; deep learning;

D O I：

10.1109/ACCESS.2021.3116882

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatically understanding and describing the visual content of videos in natural language is a challenging task in computer vision. Existing approaches are often designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear in a video. This work presents the OSVidCap, a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and deal with unknown ones. The OSVidCap is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, we employ the TI3D Network with the Extreme Value Machine (EVM), which learns representations and recognizes unknown actions. We evaluate the proposed approach on the benchmark ActivityNet Captions dataset. Also, an enhanced version of the LIRIS human activity dataset was proposed by providing descriptions for each action. We also provide spatial, temporal, and caption annotations for existing unlabeled actions in the dataset - considered unknown actions in our experiments. Experimental results showed our method's effectiveness in recognizing and describing concurrent actions in natural language and the strong ability to deal with detected unknown activities. Based on these results, we believe that the proposed approach can be potentially helpful for many real-world applications, including human behavior analysis, safety monitoring, and surveillance.

引用

页码：137029 / 137041

页数：13

共 62 条

[1]

Aafaq N., 2021, ARRAY, V9

[2] Video Description: A Survey of Methods, Datasets, and Evaluation Metrics [J].

Aafaq, Nayyer ;

Mian, Ajmal ;

Liu, Wei ;

Gilani, Syed Zulqarnain ;

Shah, Mubarak .

ACM COMPUTING SURVEYS, 2020, 52 (06)

[3]

[Anonymous], 2014, Comput. Sci.

[4]

[Anonymous], P C N AM ASS COMP LI

[5]

[Anonymous], 2015, CORR

[6]

Bochkovskiy A., ARXIV200410934

[7]

Bojanowski Piotr, 2017, Trans. Assoc. Comput. Linguist., V5, P135, DOI DOI 10.1162/TACL_A_00051

[8]

Budvytis I., 2019, BMVC

[9]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[10] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].

Cao, Zhe ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310

← 1 2 3 4 5 6 7 →