Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

被引:0
作者
Yang, Zhibo [1 ,2 ]
Mondal, Sounak [1 ]
Ahn, Seoyoung [1 ]
Xue, Ruoyu [1 ]
Zelinsky, Gregory [1 ]
Minh Hoai [1 ,3 ]
Samaras, Dimitris [1 ]
机构
[1] SUNY Stony Brook, Stony Brook, NY 11794 USA
[2] Waymo LLC, Mountain View, CA 94043 USA
[3] VinAI Res, Hanoi, Vietnam
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年
基金
美国国家科学基金会;
关键词
EYE-MOVEMENTS; ATTENTION; SEARCH;
D O I
10.1109/CVPR52733.2024.00166
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatiotemporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and taskless free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.
引用
收藏
页码:1683 / 1693
页数:11
相关论文
共 69 条
  • [1] Assens Marc, 2018, ECCV WORKSH
  • [2] Free viewing of dynamic stimuli by humans and monkeys
    Berg, David J.
    Boehnke, Susan E.
    Marino, Robert A.
    Munoz, Douglas P.
    Itti, Laurent
    [J]. JOURNAL OF VISION, 2009, 9 (05):
  • [3] Bethge Matthias, 2021, ARXIV
  • [4] Anchoring visual search in scenes: Assessing the role of anchor objects on eye movements during visual search
    Boettcher, Sage E. P.
    Draschkow, Dejan
    Dienhart, Eric
    Vo, Melissa L-H.
    [J]. JOURNAL OF VISION, 2018, 18 (13): : 1 - 13
  • [5] Analysis of scores, datasets, and models in visual saliency prediction
    Borji, Ali
    Tavakoli, Hamed R.
    Sihite, Dicky N.
    Itti, Laurent
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 921 - 928
  • [6] State-of-the-Art in Visual Attention Modeling
    Borji, Ali
    Itti, Laurent
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) : 185 - 207
  • [7] Salient Object Detection: A Benchmark
    Borji, Ali
    Sihite, Dicky N.
    Itti, Laurent
    [J]. COMPUTER VISION - ECCV 2012, PT II, 2012, 7573 : 414 - 429
  • [8] What Do Different Evaluation Metrics Tell Us About Saliency Models?
    Bylinskii, Zoya
    Judd, Tilke
    Oliva, Aude
    Torralba, Antonio
    Durand, Fredo
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (03) : 740 - 757
  • [9] Carion N., 2020, EUR C COMP VIS, DOI [DOI 10.1007/978-3-030-58452-813, 10.48550/arXiv. 2005.12872, DOI 10.48550/ARXIV.2005.12872]
  • [10] Transformer Tracking
    Chen, Xin
    Yan, Bin
    Zhu, Jiawen
    Wang, Dong
    Yang, Xiaoyun
    Lu, Huchuan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8122 - 8131