Future Feature-Based Supervised Contrastive Learning for Streaming Perception

被引：1

作者：

Wang, Tongbo ^{[1
]}

Huang, Hua ^{[2
]}

机构：

[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China

[2] Beijing Normal Univ, Sch Artificial Intelligence, Beijing 100875, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 12期

关键词：

Streaming media; Object detection; Contrastive learning; Feature extraction; Accuracy; Task analysis; Real-time systems; Video object detection; streaming perception; supervised contrastive learning; appearance features;

D O I：

10.1109/TCSVT.2024.3439692

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Streaming perception, a critical task in computer vision, involves the real-time prediction of object locations within video sequences based on prior frames. While current methods like StreamYOLO mainly rely on coordinate information, they often fall short of delivering precise predictions due to feature misalignment between input data and supervisory labels. In this paper, a novel method, Future Feature-based Supervised Contrastive Learning (FFSCL), is introduced to address this challenge by incorporating appearance features from future frames and leveraging supervised contrastive learning techniques. FFSCL establishes a robust correspondence between the appearance of an object in current and past frames and its location in the subsequent frame. This integrated method significantly improves the accuracy of object position prediction in streaming perception tasks. In addition, the FFSCL method includes a sample pair construction module (SPC) for the efficient creation of positive and negative samples based on future frame labels and a feature consistency loss (FCL) to enhance the effectiveness of supervised contrastive learning by linking appearance features from future frames with those from past frames. The efficacy of FFSCL is demonstrated through extensive experiments on two large-scale benchmark datasets, where FFSCL consistently outperforms state-of-the-art methods in streaming perception tasks. This study represents a significant advancement in the incorporation of supervised contrastive learning techniques and future frame information into the realm of streaming perception, paving the way for more accurate and efficient object prediction within video streams.

引用

页码：13611 / 13625

页数：15

共 57 条

[1] Ge Z., Liu S., Wang F., Li Z., Sun J., YOLOX: Exceeding YOLO series in 2021, (2021)
[2] Ren S., He K., Girshick R., Sun J., Faster R-CNN: Towards real-time object detection with region proposal networks, in Proc. Adv. Neural Inf. Process. Syst., pp. 1137-1149, (2015)
[3] Redmon J., Divvala S., Girshick R., Farhadi A., You only look once: Unified, real-time object detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 779-788, (2016)
[4] Redmon J., Farhadi A., YOLO9000: Better, faster, stronger, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 6517-6525, (2017)
[5] Liu W., Et al., SSD: Single shot multibox detector, in Proc. 14th Eur. Conf. Comput. Vis., pp. 21-37, (2016)
[6] Lin T.-Y., Goyal P., Girshick R., He K., Dollar P., Focal loss for dense object detection, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 2999-3007, (2017)
[7] Wu B., Nevatia R., Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors, in Proc. 10th IEEE Int. Conf. Comput. Vis. (ICCV), pp. 90-97, (2005)
[8] Han L., Wang P., Yin Z., Wang F., Li H., Class-aware feature aggregation network for video object detection, IEEE Trans. Circuits Syst. Video Technol., 32, 12, pp. 8165-8178, (2022)
[9] Zhu Z., Li Z., Online video object detection via local and mid-range feature propagation, in Proc. 1st Int. Workshop Hum.-Centric Multimedia Anal., pp. 73-82, (2020)
[10] He F., Li Q., Zhao X., Huang K., Temporal-adaptive sparse feature aggregation for video object detection, Pattern Recognit, 127, (2022)

← 1 2 3 4 5 6 →