Joint learning of images and videos with a single Vision Transformer

被引：0

作者：

Shimizu, Shuki ^{[1
]}

Tamaki, Toru ^{[1
]}

机构：

[1] Nagoya Inst Technol, Nagoya, Japan

来源：

2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA | 2023年

关键词：

D O I：

10.23919/MVA57639.2023.10215661

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

引用

收藏

页数：6

相关论文

共 50 条

[31] Manipulation Detection in Satellite Images Using Vision Transformer [J].

Horvath, Janos ;

Baireddy, Sriram ;

Hao, Hanxiang ;

Montserrat, Daniel Mas ;

Delp, Edward J. .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :1032-1041

[32] Recognizing persons in images by learning from videos [J].

Hoerster, Eva ;

Lux, Jochen ;

Lienhart, Rainer .

MULTIMEDIA CONTENT ACCESS: ALGORITHMS AND SYSTEMS, 2007, 6506

[33] Learning the representation of instrument images in laparoscopy videos [J].

Kletz, Sabrina ;

Schoeffmann, Klaus ;

Husslein, Heinrich .

HEALTHCARE TECHNOLOGY LETTERS, 2019, 6 (06) :197-203

[34] ViT-MPI: Vision Transformer Multiplane Images for Surgical Single-View View Synthesis [J].

Han, Chenming ;

Shao, Ruizhi ;

Wu, Gaochang ;

Shao, Hang ;

Liu, Yebin .

ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 :28-40

[35] Vision Transformer Adapters for Generalizable Multitask Learning [J].

Bhattacharjee, Deblina ;

Susstrunk, Sabine ;

Salzmann, Mathieu .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :18969-18980

[36] A New Contrastive Learning-Based Vision Transformer for Sentiment Analysis Using Scene Text Images [J].

Palaiahnakote, Shivakumara ;

Kapri, Dhruv ;

Saleem, Muhammad Hammad ;

Pal, Umapada .

International Journal of Pattern Recognition and Artificial Intelligence, 2024, 38 (16)

[37] Anomaly detection in surveillance videos using Transformer with margin learning [J].

Wang, Dicong ;

Wu, Kaijun .

MULTIMEDIA SYSTEMS, 2024, 30 (05)

[38] Medical Report Generation from Medical Images Using Vision Transformer and Bart Deep Learning Architectures [J].

Ucan, Murat ;

Kaya, Buket ;

Kaya, Mehmet ;

Alhajj, Reda .

SOCIAL NETWORKS ANALYSIS AND MINING, ASONAM 2024, PT IV, 2025, 15214 :257-267

[39] Online Continual Learning with Contrastive Vision Transformer [J].

Wang, Zhen ;

Liu, Liu ;

Kong, Yajing ;

Guo, Jiaxian ;

Tao, Dacheng .

COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 :631-650

[40] Effective and Robust: A Discriminative Temporal Learning Transformer for Satellite Videos [J].

Zhang, Xin ;

Jiao, Licheng ;

Li, Lingling ;

Liu, Xu ;

Liu, Fang ;

Yang, Shuyuan .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62

← 1 2 3 4 5 →