A Framework for Memory Oversubscription Management in Graphics Processing Units

被引：52

作者：

Li, Chen ^{[1
,2
]}

Ausavarungnirun, Rachata ^{[3
,6
]}

Rossbach, Christopher J. ^{[4
,5
]}

Zhang, Youtao ^{[2
]}

Mutlu, Onur ^{[3
,7
]}

Guo, Yang ^{[1
]}

Yang, Jun ^{[2
]}

机构：

[1] Natl Univ Def Technol, Changsha, Peoples R China

[2] Univ Pittsburgh, Pittsburgh, PA 15260 USA

[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[4] Univ Texas Austin, Austin, TX 78712 USA

[5] VMware Res, Jersey City, NJ USA

[6] King Mongkuts Univ Technol North Bangkok, Bangkok, Thailand

[7] Swiss Fed Inst Technol, Zurich, Switzerland

来源：

TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV) | 2019年

基金：

美国国家科学基金会; 中国国家自然科学基金;

关键词：

graphics processing units; GPGPU applications; virtual memory management; oversubscription;

D O I：

10.1145/3297858.3304044

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Modern discrete GPUs support unified memory and demand paging. Automatic management of data movement between CPU memory and GPU memory dramatically reduces developer effort. However, when application working sets exceed physical memory capacity, the resulting data movement can cause great performance loss. This paper proposes a memory management framework, called ETC, that transparently improves GPU performance under memory oversubscription using new techniques to overlap eviction latency of GPU pages, reduce thrashing cost, and increase effective memory capacity. Eviction latency can be hidden by eagerly creating space for demand-paged data with proactive eviction (E). Thrashing costs can be ameliorated with memory-aware throttling (T), which dynamically reduces the GPU parallelism when page fault frequencies become high. Capacity compression (C) can enable larger working sets without increasing physical memory capacity. No single technique fits all workloads, and, thus, ETC integrates proactive eviction, memory-aware throttling and capacity compression into a principled framework that dynamically selects the most effective combination of techniques, transparently to the running software. To this end, ETC categorizes applications into three categories: regular applications without data sharing across kernels, regular applications with data sharing across kernels, and irregular applications. Our evaluation shows that ETC fully mitigates the oversubscription overhead for regular applications without data sharing and delivers performance similar to the ideal unlimited GPU memory baseline. We also show that ETC outperforms the state-of-the-art baseline by 60.4% and 270% for regular applications with data sharing and irregular applications, respectively.

引用

页码：49 / 63

页数：15

共 102 条

[11]

[Anonymous], 2014, HMC SPEC 2 0

[12]

[Anonymous], 2015, CUDA C Programming Guide

[13]

Ausavarungnirun R., 2018, ASPLOS

[14]

Bakhoda A., 2009, ISPASS

[15]

Barr T. W., 2011, ISCA

[16]

Barr T.W., 2010, ISCA

[17]

Basu A., 2013, ISCA

[18]

Belady L. A., 1966, IBM SYSTEMS J

[19]

Bhattacharjee A., 2013, MICRO

[20]

Bhattacharjee A., 2011, HPCA

← 1 2 3 4 5 6 7 8 9 10 →