Fast Retinomorphic Event-Driven Representations for Video Gameplay and Action Recognition

被引：4

作者：

Chen, Huaijin ^{[1
,2
]}

Liu, Wanjia ^{[3
,4
]}

Goel, Rishab ^{[5
,6
]}

Lua, Rhonald C. ^{[7
]}

Mittal, Siddharth ^{[8
,9
]}

Huang, Yuzhong ^{[10
,11
]}

Veeraraghavan, Ashok ^{[1
]}

Patel, Ankit B. ^{[7
]}

机构：

[1] Rice Univ, Dept Elect & Comp Engn, Houston, TX 77005 USA

[2] SenseBrain Technol LLC, San Jose, CA 95131 USA

[3] Rice Univ, Dept Comp Sci, Houston, TX 77005 USA

[4] Google Inc, Mountain View, CA 94043 USA

[5] Indian Inst Technol Delhi, New Delhi 110016, India

[6] Borealis AI, Montreal, PQ H2S 3H1, Canada

[7] Baylor Coll Med, Dept Neurosci, Houston, TX 77030 USA

[8] Indian Inst Technol Kanpur, Kanpur 208016, Uttar Pradesh, India

[9] Quadeye, Gurgaon 122009, India

[10] Olin Coll Engn, Needham, MA 02492 USA

[11] Kensho Technol, Cambridge, MA 02138 USA

来源：

IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING | 2020年 / 6卷

基金：

美国国家科学基金会;

关键词：

Smart cameras; retina; real-time systems; streaming media; cells (biology); reinforcement learning; video signal processing; video; ON-CENTER; CELLS; CONTRAST;

D O I：

10.1109/TCI.2019.2948755

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Good temporal representations are crucial for video understanding, and the state-of-the-art video recognition framework is based on two-stream networks. In such framework, besides the regular ConvNets responsible for RGB frame inputs, a second network is introduced to handle the temporal representation, usually the optical flow (OF). However, OF or other task-oriented flow is computationally costly, and is thus typically pre-computed. Critically, this prevents the two-stream approach from being applied to reinforcement learning (RL) applications such as video game playing, where the next state depends on current state and action choices. Inspired by the early vision systems of mammals and insects, we propose a fast event-driven representation (EDR) that models several major properties of early retinal circuits: (1) log-arithmic input response, (2) multi-timescale temporal smoothing to filter noise, and (3) bipolar (ON/OFF) pathways for primitive event detection. Trading off the directional information for fast speed (>9000 fps), EDR enables fast real-time inference/learning in video applications that require interaction between an agent and the world such as game-playing, virtual robotics, and domain adaptation. In this vein, we use EDR to demonstrate performance improvements over state-of-the-art reinforcement learning algorithms for Atari games, something that has not been possible with pre-computed OF. Moreover, with UCF-101 video action recognition experiments, we show that EDR performs near state-of-the-art in accuracy while achieving a 1,500x speedup in input representation processing, as compared to optical flow.

引用

页码：276 / 290

页数：15