Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

被引：0

作者：

Yang, Zhibo ^{[1
,2
]}

Mondal, Sounak ^{[1
]}

Ahn, Seoyoung ^{[1
]}

Xue, Ruoyu ^{[1
]}

Zelinsky, Gregory ^{[1
]}

Minh Hoai ^{[1
,3
]}

Samaras, Dimitris ^{[1
]}

机构：

[1] SUNY Stony Brook, Stony Brook, NY 11794 USA

[2] Waymo LLC, Mountain View, CA 94043 USA

[3] VinAI Res, Hanoi, Vietnam

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

基金：

美国国家科学基金会;

关键词：

EYE-MOVEMENTS; ATTENTION; SEARCH;

D O I：

10.1109/CVPR52733.2024.00166

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatiotemporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and taskless free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.

引用

页码：1683 / 1693

页数：11

共 69 条

[1] Assens Marc, 2018, ECCV WORKSH
[2] Free viewing of dynamic stimuli by humans and monkeys
Berg, David J.
Boehnke, Susan E.
Marino, Robert A.
Munoz, Douglas P.
Itti, Laurent
[J]. JOURNAL OF VISION, 2009, 9 (05):
[3] Bethge Matthias, 2021, ARXIV
[4] Anchoring visual search in scenes: Assessing the role of anchor objects on eye movements during visual search
Boettcher, Sage E. P.
Draschkow, Dejan
Dienhart, Eric
Vo, Melissa L-H.
[J]. JOURNAL OF VISION, 2018, 18 (13): : 1 - 13
[5] Analysis of scores, datasets, and models in visual saliency prediction
Borji, Ali
Tavakoli, Hamed R.
Sihite, Dicky N.
Itti, Laurent
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 921 - 928
[6] State-of-the-Art in Visual Attention Modeling
Borji, Ali
Itti, Laurent
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) : 185 - 207
[7] Salient Object Detection: A Benchmark
Borji, Ali
Sihite, Dicky N.
Itti, Laurent
[J]. COMPUTER VISION - ECCV 2012, PT II, 2012, 7573 : 414 - 429
[8] What Do Different Evaluation Metrics Tell Us About Saliency Models?
Bylinskii, Zoya
Judd, Tilke
Oliva, Aude
Torralba, Antonio
Durand, Fredo
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (03) : 740 - 757
[9] Carion N., 2020, EUR C COMP VIS, DOI [DOI 10.1007/978-3-030-58452-813, 10.48550/arXiv. 2005.12872, DOI 10.48550/ARXIV.2005.12872]
[10] Transformer Tracking
Chen, Xin
Yan, Bin
Zhu, Jiawen
Wang, Dong
Yang, Xiaoyun
Lu, Huchuan
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8122 - 8131

← 1 2 3 4 5 6 7 →