AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction

被引:35
作者
Zheng, Size [1 ]
Chen, Renze [1 ]
Wei, Anjiang [2 ]
Jin, Yicheng [2 ]
Han, Qin [2 ]
Lu, Liqiang [2 ]
Wu, Bingyang [2 ]
Li, Xiuhong [3 ,4 ]
Yan, Shengen [3 ]
Liang, Yun [1 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Stanford Univ, Stanford, CA USA
[3] Sensetime Res, Beijing, Peoples R China
[4] Shanghai Lab, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22) | 2022年
基金
中国国家自然科学基金;
关键词
spatial accelerators; code generation; mapping; tensor computations;
D O I
10.1145/3470496.3527440
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Hardware specialization is a promising trend to sustain performance growth. Spatial hardware accelerators that employ specialized and hierarchical computation and memory resources have recently shown high performance gains for tensor applications such as deep learning, scientific computing, and data mining. To harness the power of these hardware accelerators, programmers have to use specialized instructions with certain hardware constraints. However, these hardware accelerators and instructions are quite new and there is a lack of understanding of the hardware abstraction, performance optimization space, and automatic methodologies to explore the space. Existing compilers use hand-tuned computation implementations and optimization templates, resulting in sub-optimal performance and heavy development costs. In this paper, we propose AMOS, which is an automatic compilation framework for spatial hardware accelerators. Central to this framework is the hardware abstraction that not only clearly specifies the behavior of spatial hardware instructions, but also formally defines the mapping problem from software to hardware. Based on the abstraction, we develop algorithms and performance models to explore various mappings automatically. Finally, we build a compilation framework that uses the hardware abstraction as compiler intermediate representation (IR), explores both compute mappings and memory mappings, and generates high-performance code for different hardware backends. Our experiments show that AMOS achieves more than 2.50x speedup to hand-optimized libraries on Tensor Core, 1.37x speedup to TVM on vector units of Intel CPU for AVX-512, and up to 25.04x speedup to AutoTVM on dot units of Mali GPU. The source code of AMOS is publicly available.
引用
收藏
页码:874 / 887
页数:14
相关论文
共 69 条
  • [31] Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing
    Liao, Heng
    Tu, Jiajin
    Xia, Jing
    Liu, Hu
    Zhou, Xiping
    Yuan, Honghui
    Hu, Yuxing
    [J]. 2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), 2021, : 789 - 801
  • [32] TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation
    Lu, Liqiang
    Guan, Naiqing
    Wang, Yuyue
    Jia, Liancheng
    Luo, Zizhang
    Yin, Jieming
    Cong, Jason
    Liang, Yun
    [J]. 2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 720 - 733
  • [33] Ma LX, 2020, PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), P881
  • [34] Ma NN, 2020, Arxiv, DOI arXiv:2007.11823
  • [35] CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
    Manavski, Svetlin A.
    Valle, Giorgio
    [J]. BMC BIOINFORMATICS, 2008, 9 (Suppl 2)
  • [36] Markidis S, 2018, Arxiv, DOI arXiv:1803.04014
  • [37] Automatically Scheduling Halide Image Processing Pipelines
    Mullapudi, Ravi Teja
    Adams, Andrew
    Sharlet, Dillon
    Ragan-Kelley, Jonathan
    Fatahalian, Kayvon
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2016, 35 (04):
  • [38] A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories
    Nowatzki, Tony
    Sartin-Tarm, Michael
    De Carli, Lorenzo
    Sankaralingam, Karthikeyan
    Estan, Cristian
    Robatmili, Behnam
    [J]. ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 2015, 37 (01):
  • [39] Nvidia, 2022, CUTLASS
  • [40] Nvidia, 2022, VOLT ARCH WHIT PAP