Active Learning for Video Classification with Frame Level Queries

被引：3

作者：

Goswami, Debanjan ^{[1
]}

Chakraborty, Shayok ^{[1
]}

机构：

[1] Florida State Univ, Dept Comp Sci, Tallahassee, FL 32306 USA

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

基金：

美国国家科学基金会;

关键词：

active learning; video classification; deep learning;

D O I：

10.1109/IJCNN54540.2023.10191348

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep learning algorithms have pushed the boundaries of computer vision research and have depicted commendable performance in a variety of applications. However, training a robust deep neural network necessitates a large amount of labeled training data, acquiring which involves significant time and human effort. This problem is even more serious for an application like video classification, where a human annotator has to watch an entire video end-to-end to furnish a label. Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data; this tremendously reduces the human annotation effort in inducing a machine learning model, as only the few samples that are identified by the algorithm, need to be labeled manually. In this paper, we propose a novel active learning framework for video classification, with the goal of further reducing the labeling onus on the human annotators. Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video; the human annotator needs to merely review the frames and provide a label for each video. This involves much less manual work than watching the complete video to come up with a label. We formulate a criterion based on uncertainty and diversity to identify the informative videos and exploit representative sampling techniques to extract a set of exemplar frames from each video. To the best of our knowledge, this is the first research effort to develop an active learning framework for video classification, where the annotators need to inspect only a few frames to produce a label, rather than watching the end-to-end video. Our extensive empirical analyses corroborate the potential of our method to substantially reduce human annotation effort in applications like video classification, where annotating a single data instance can be extremely tedious.

引用

页数：9

共 53 条

[1]

[Anonymous], 2017, ARXIV 1705 06950

[2]

[Anonymous], 2010, IEEE C COMP VIS PATT

[3]

[Anonymous], 2010, IEEE C COMP VIS PATT

[4] Active Learning of an Action Detector from Untrimmed Videos [J].

Bandla, Sunil ;

Grauman, Kristen .

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :1833-1840

[5]

Behpour S., 2020, ARXIV191212557

[6]

Bhattacharya A., 2019, IEEE WINT C APPL COM

[7]

Biswas A., 2012, IEEE C COMP VIS PATT

[8]

Cai J., 2019, INT JOINT C ART INT

[9]

Chan D., 2020, AS C COMP VIS ACCV

[10]

Chattopadhyay Rita., 2013, P ICML, P253

← 1 2 3 4 5 6 →