TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation

被引:42
作者
Lu, Liqiang [1 ]
Guan, Naiqing [1 ,2 ]
Wang, Yuyue [1 ]
Jia, Liancheng [1 ]
Luo, Zizhang [1 ]
Yin, Jieming [3 ]
Cong, Jason [4 ]
Liang, Yun [1 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Univ Toronto, Toronto, ON, Canada
[3] Lehigh Univ, Bethlehem, PA 18015 USA
[4] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
来源
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021) | 2021年
基金
北京市自然科学基金;
关键词
DECOMPOSITIONS; ACCELERATOR;
D O I
10.1109/ISCA52012.2021.00062
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Accelerating tensor applications on spatial architectures provides high performance and energy-efficiency, but requires accurate performance models for evaluating various dataflow alternatives. Such modeling relies on the notation of tensor dataflow and the formulation of performance metrics. Recent proposed compute-centric and data-centric notations describe the dataflow using imperative directives. However, these two notations are less expressive and thus lead to limited optimization opportunities and inaccurate performance models. In this paper, we propose a framework TENET that models hardware dataflow of tensor applications. We start by introducing a relation-centric notation, which formally describes the hardware dataflow for tensor computation. The relation-centric notation specifies the hardware dataflow, PE interconnection, and data assignment in a uniform manner using relations. The relation-centric notation is more expressive than the compute-centric and data-centric notations by using more sophisticated affine transformations. Another advantage of relation-centric notation is that it inherently supports accurate metrics estimation, including data reuse, bandwidth, latency, and energy. TENET computes each performance metric by counting the relations using integer set structures and operators. Overall, TENET achieves 37.4% and 51.4% latency reduction for CONY and GEMM kernels compared with the state-of-the-art data-centric notation by identifying more sophisticated hardware dataflows.
引用
收藏
页码:720 / 733
页数:14
相关论文
共 55 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Anandkumar A, 2014, J MACH LEARN RES, V15, P2773
  • [3] [Anonymous], 2013, ACM SIGARCH COMPUTER
  • [4] [Anonymous], 2013, P 7 ACM C REC SYST
  • [5] Bachrach J, 2012, DES AUT CON, P1212
  • [6] Baltus D. G., 1993, P INT C APPL SPEC AR
  • [7] Bennett J., 2007, P KDD CUP WORKSH, VVol. 2007, P35
  • [8] A Practical Automatic Polyhedral Parallelizer and Locality Optimizer
    Bondhugula, Uday
    Hartono, Albert
    Ramanujam, J.
    Sadayappan, R.
    [J]. PLDI'08: PROCEEDINGS OF THE 2008 SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN & IMPLEMENTATION, 2008, : 101 - +
  • [9] Scaling to the end of silicon with EDGE architectures
    Burger, D
    Keckler, SW
    McKinley, KS
    Dahlin, M
    John, LK
    Lin, C
    Moore, CR
    Burrill, J
    McDonald, RG
    Yoder, R
    [J]. COMPUTER, 2004, 37 (07) : 44 - +
  • [10] DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning
    Chen, Tianshi
    Du, Zidong
    Sun, Ninghui
    Wang, Jia
    Wu, Chengyong
    Chen, Yunji
    Temam, Olivier
    [J]. ACM SIGPLAN NOTICES, 2014, 49 (04) : 269 - 283