Action Recognition and Benchmark Using Event Cameras

被引:6
作者
Gao, Yue [1 ]
Lu, Jiaxuan [1 ]
Li, Siqi [1 ]
Ma, Nan [2 ]
Du, Shaoyi [3 ,4 ]
Li, Yipeng [5 ]
Dai, Qionghai [5 ]
机构
[1] Tsinghua Univ, Sch Software, BNRist, THUIBCS,KLISS,BLBCI, Beijing 100084, Peoples R China
[2] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Beijing 100124, Peoples R China
[3] Xi An Jiao Tong Univ, Natl Engn Res Ctr Visual Informat & Applicat, Natl Key Lab Human Machine Hybrid Augmented Intel, Xian 710049, Peoples R China
[4] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Peoples R China
[5] Tsinghua Univ, BNRist, THUIBCS, Dept Automat,BLBCI, Beijing 100084, Peoples R China
关键词
Action recognition; dynamic vision sensor; event camera; event representation; VISION;
D O I
10.1109/TPAMI.2023.3300741
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed remarkable achievements in video-based action recognition. Apart from traditional frame-based cameras, event cameras are bio-inspired vision sensors that only record pixel-wise brightness changes rather than the brightness value. However, little effort has been made in event-based action recognition, and large-scale public datasets are also nearly unavailable. In this paper, we propose an event-based action recognition framework called EV-ACT. The Learnable Multi-Fused Representation (LMFR) is first proposed to integrate multiple event information in a learnable manner. The LMFR with dual temporal granularity is fed into the event-based slow-fast network for the fusion of appearance and motion features. A spatial-temporal attention mechanism is introduced to further enhance the learning capability of action recognition. To prompt research in this direction, we have collected the largest event-based action recognition benchmark named THUE-ACT-50 and the accompanying THUE-ACT-50-CHL dataset under challenging environments, including a total of over 12,830 recordings from 50 action categories, which is over 4 times the size of the previous largest dataset. Experimental results show that our proposed framework could achieve improvements of over 14.5%, 7.6%, 11.2%, and 7.4% compared to previous works on four benchmarks. We have also deployed our proposed EV-ACT framework on a mobile platform to validate its practicality and efficiency.
引用
收藏
页码:14081 / 14097
页数:17
相关论文
共 58 条
[1]   Distance Surface for Event-Based Optical Flow [J].
Almatrafi, Mohammed ;
Baldwin, Raymond ;
Aizawa, Kiyoharu ;
Hirakawa, Keigo .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (07) :1547-1556
[2]   DAViS Camera Optical Flow [J].
Almatrafi, Mohammed ;
Hirakawa, Keigo .
IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, 2020, 6 :396-407
[3]   Time-Ordered Recent Event (TORE) Volumes for Event Cameras [J].
Baldwin, R. Wes ;
Liu, Ruixu ;
Almatrafi, Mohammed ;
Asari, Vijayan ;
Hirakawa, Keigo .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) :2519-2532
[4]   Event-Based Visual Flow [J].
Benosman, Ryad ;
Clercq, Charles ;
Lagorce, Xavier ;
Ieng, Sio-Hoi ;
Bartolozzi, Chiara .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2014, 25 (02) :407-417
[5]  
Berner Raphael, 2013, 2013 Symposium on VLSI Circuits, pC186
[6]   A 240 x 180 130 dB 3 μs Latency Global Shutter Spatiotemporal Vision Sensor [J].
Brandli, Christian ;
Berner, Raphael ;
Yang, Minhao ;
Liu, Shih-Chii ;
Delbruck, Tobi .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2014, 49 (10) :2333-2341
[7]  
Brandli C, 2014, IEEE INT SYMP CIRC S, P686, DOI 10.1109/ISCAS.2014.6865228
[8]   DHP19: Dynamic Vision Sensor 3D Human Pose Dataset [J].
Calabrese, Enrico ;
Taverni, Gemma ;
Easthope, Christopher Awai ;
Skriabine, Sophie ;
Corradi, Federico ;
Longinotti, Luca ;
Eng, Kynan ;
Delbruck, Tobi .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, :1695-1704
[9]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[10]   Live Demonstration: CeleX-V: a 1M Pixel Multi-Mode Event-based Sensor [J].
Chen Shoushun ;
Guo Menghan .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, :1682-1683