Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

被引:80
作者
Zeng, Wang [1 ]
Jin, Sheng [2 ,3 ]
Liu, Wentao [3 ]
Qian, Chen [3 ]
Luo, Ping [2 ]
Ouyang, Wanli [4 ]
Wang, Xiaogang [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Univ Hong Kong, Hong Kong, Peoples R China
[3] SenseTime Res & Tetras AI, Hangzhou, Peoples R China
[4] Univ Sydney, Sydney, NSW, Australia
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
基金
澳大利亚研究理事会;
关键词
D O I
10.1109/CVPR52688.2022.01082
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git.
引用
收藏
页码:11091 / 11101
页数:11
相关论文
共 76 条
[1]   2D Human Pose Estimation: New Benchmark and State of the Art Analysis [J].
Andriluka, Mykhaylo ;
Pishchulin, Leonid ;
Gehler, Peter ;
Schiele, Bernt .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3686-3693
[2]   Robust face landmark estimation under occlusion [J].
Burgos-Artizzu, Xavier P. ;
Perona, Pietro ;
Dollar, Piotr .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :1513-1520
[3]   Face Alignment by Explicit Shape Regression [J].
Cao, Xudong ;
Wei, Yichen ;
Wen, Fang ;
Sun, Jian .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 107 (02) :177-190
[4]   OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].
Cao, Zhe ;
Hidalgo, Gines ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186
[5]   Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].
Cao, Zhe ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310
[6]   HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation [J].
Cheng, Bowen ;
Xiao, Bin ;
Wang, Jingdong ;
Shi, Honghui ;
Huang, Thomas S. ;
Zhang, Lei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :5385-5394
[7]  
Choi H, 2020, Img Proc Comp Vis Re, V12352, P769, DOI 10.1007/978-3-030-58571-6_45
[8]  
CHU X, 2017, PROC CVPR IEEE, P5669, DOI DOI 10.1109/CVPR.2017.601
[9]  
Dosovitskiy A, 2020, ARXIV
[10]   Study on density peaks clustering based on k-nearest neighbors and principal component analysis [J].
Du, Mingjing ;
Ding, Shifei ;
Jia, Hongjie .
KNOWLEDGE-BASED SYSTEMS, 2016, 99 :135-145