Egocentric Early Action Prediction via Multimodal Transformer-Based Dual Action Prediction

被引:13
作者
Guan, Weili [1 ,4 ]
Song, Xuemeng [2 ]
Wang, Kejie [2 ]
Wen, Haokun [3 ]
Ni, Hongda [3 ]
Wang, Yaowei [4 ]
Chang, Xiaojun [5 ]
机构
[1] Monash Univ, Dept Data Sci & Artificial Intelligence, Clayton, Vic 3800, Australia
[2] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266000, Peoples R China
[3] Harbin Inst Technol Shenzhen, Sch Comp Sci & Technol, Shenzhen 518055, Peoples R China
[4] AI Res Ctr, Peng Cheng Lab, Shenzhen 518055, Peoples R China
[5] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, NSW 2007, Australia
基金
澳大利亚研究理事会;
关键词
Transformers; Predictive models; Task analysis; Correlation; Computational modeling; Encoding; Training; Egocentric early action prediction; transformer; mutual enhancement; ACTION RECOGNITION; ATTENTION;
D O I
10.1109/TCSVT.2023.3248271
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Egocentric early action prediction, which aims to recognize the on-going action in the video captured in the first-person view as early as possible before the action is fully executed, is a new yet challenging task due to the limited partial video input. Pioneer studies focused on solving this task with LSTMs as the backbone and simply compiling the observed video segment and unobserved video segment into a single vector, which hence suffer from two key limitations: lack the non-sequential relation modeling with the video snippet sequence and the correlation modeling between the observed and unobserved video segment. To address these two limitations, in this paper, we propose a novel multimodal TransfoRmer-based duAl aCtion prEdiction (mTRACE) model for the task of egocentric early action prediction, which consists of two key modules: the early (observed) segment action prediction module and the future (unobserved) segment action prediction module. Both modules take Transformer encoders as the backbone for encoding all the potential relations among the input video snippets, and involve several single-modal and multi-modal classifiers for comprehensive supervision. Different from previous work, each of the two modules outputs two multi-modal feature vectors: one for encoding the current input video segment, and the other one for predicting the missing video segment. For optimization, we design a two-stage training scheme, including the mutual enhancement stage and end-to-end aggregation stage. The former stage alternatively optimizes the two action prediction modules, where the correlation between the observed and unobserved video segment is modeled with a consistency regularizer, while the latter seamlessly aggregates the two modules to fully utilize the capacity of the two modules. Extensive experiments have demonstrated the superiority of our proposed model. We have released the codes and the corresponding parameters to benefit other researchers at https://trace729.wixsite.com/trace.
引用
收藏
页码:4472 / 4483
页数:12
相关论文
共 33 条
[1]  
Alvarez WM, 2020, IEEE INT VEH SYM, P39, DOI 10.1109/IV47402.2020.9304624
[2]  
Cai YJ, 2019, AAAI CONF ARTIF INTE, P8118
[3]   Vision-Enhanced and Consensus-Aware Transformer for Image Captioning [J].
Cao, Shan ;
An, Gaoyun ;
Zheng, Zhenxing ;
Wang, Zhiyong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) :7005-7018
[4]   Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [J].
Damen, Dima ;
Doughty, Hazel ;
Farinella, Giovanni Maria ;
Fidler, Sanja ;
Furnari, Antonino ;
Kazakos, Evangelos ;
Moltisanti, Davide ;
Munro, Jonathan ;
Perrett, Toby ;
Price, Will ;
Wray, Michael .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :753-771
[5]  
Dosovitskiy A., 2021, P INT C LEARN REP, P1
[6]   Multiscale Vision Transformers [J].
Fan, Haoqi ;
Xiong, Bo ;
Mangalam, Karttikeya ;
Li, Yanghao ;
Yan, Zhicheng ;
Malik, Jitendra ;
Feichtenhofer, Christoph .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6804-6815
[7]   What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention [J].
Furnari, Antonino ;
Farinella, Giovanni Maria .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6261-6270
[8]   Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video [J].
Furnari, Antonino ;
Farinella, Giovanni Maria .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) :4021-4036
[9]   Anticipative Video Transformer [J].
Girdhar, Rohit ;
Grauman, Kristen .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :13485-13495
[10]   Multifeature Selection for 3D Human Action Recognition [J].
Huang, Min ;
Su, Song-Zhi ;
Zhang, Hong-Bo ;
Cai, Guo-Rong ;
Gong, Dongying ;
Cao, Donglin ;
Li, Shao-Zi .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (02)