A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers

被引:15
作者
Phothilimthana, Phitchaya Mangpo [1 ]
Sabne, Amit [1 ]
Sarda, Nikhil [1 ]
Murthy, Karthik Srinivasa [1 ]
Zhou, Yanqi [1 ]
Angermueller, Christof [1 ]
Burrows, Mike [1 ]
Roy, Sudip [1 ]
Mandke, Ketan [1 ]
Farahani, Rezsa [1 ]
Wang, Yu Emma [1 ]
Ilbeyi, Berkin [1 ]
Hechtman, Blake [1 ]
Roune, Bjarke [1 ]
Wang, Shen [1 ]
Xu, Yuanzhong [1 ]
Kaufman, Samuel J. [2 ]
机构
[1] Google, Mountain View, CA 94043 USA
[2] Univ Washington, Seattle, WA 98195 USA
来源
30TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2021) | 2021年
关键词
compiler; autotuning; machine learning;
D O I
10.1109/PACT52795.2021.00008
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search is applied. This decomposition is not only non-trivial but also significantly limits the scope of optimization. Second, prior works require search to be applied in a single stage in the compilation flow, which does not fit with the multi-stage layered architecture of most production ML compilers. This paper presents XTAT, an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations across multiple compilation stages. XTAT applies XTAT-M, a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. XTAT tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs) at Google, XTAT offers up to 2.4x and an average 5% execution time speedup over the heavily-optimized XLA compiler.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 42 条
[1]   Learning to Optimize Halide with Tree Search and Random Programs [J].
Adams, Andrew ;
Ma, Karima ;
Anderson, Luke ;
Baghdad, Riyadh ;
Li, Tzu-Mao ;
Gharbi, Michael ;
Steiner, Benoit ;
Johnson, Steven ;
Fatahalian, Kayvon ;
Durand, Fredo ;
Ragan-Kelley, Jonathan .
ACM TRANSACTIONS ON GRAPHICS, 2019, 38 (04)
[2]  
Ahn Byung Hoon, 2020, INT C LEARN REPR
[3]  
Angermueller C., 2019, INT C LEARN REPR
[4]  
Angermueller C., 2020, ARXIV PREPRINT ARXIV
[5]   OpenTuner: An Extensible Framework for Program Autotuning [J].
Ansel, Jason ;
Kamil, Shoaib ;
Veeramachaneni, Kalyan ;
Ragan-Kelley, Jonathan ;
Bosboom, Jeffrey ;
O'Reilly, Una-May ;
Amarasinghe, Saman .
PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'14), 2014, :303-315
[6]   PetaBricks: A Language and Compiler for Algorithmic Choice [J].
Ansel, Jason ;
Chan, Cy ;
Wong, Yee Lok ;
Olszewski, Marek ;
Zhao, Qin ;
Edelman, Alan ;
Amarasinghe, Saman .
PLDI'09 PROCEEDINGS OF THE 2009 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, 2009, :38-49
[7]  
Baghdadi Riyadh, 2021, P MLSYS
[8]  
Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
[9]  
Chen TQ, 2018, ADV NEUR IN, V31
[10]  
Dubach C., 2007, CF 07