Efficient fine-grained shared buffer management for multiple OpenCL devices

被引：1

作者：

Xun, Chang-qing ^{[1
,2
]}

Chen, Dong ^{[1
,2
]}

Lan, Qiang ^{[1
,2
]}

Zhang, Chun-yuan ^{[1
,2
]}

机构：

[1] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China

[2] Natl Univ Def Technol, State Key Lab High Performance Comp, Changsha 410073, Hunan, Peoples R China

来源：

JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS & ELECTRONICS | 2013年 / 14卷 / 11期

基金：

中国国家自然科学基金; 高等学校博士学科点专项科研基金;

关键词：

Shared buffer; OpenCL; Heterogeneous programming; Fine grained; CPU; GPU;

D O I：

10.1631/jzus.C1300078

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.

引用

页码：859 / 872

页数：14

共 28 条

[1]

AGARWAL A, 1995, ACM COMP AR, P2, DOI 10.1109/ISCA.1995.524544

[2]

Bal H. E., 1988, Proceedings 1988 International Conference on Computer Languages (IEEE Cat. No.88CH2647-6), P82, DOI 10.1109/ICCL.1988.13046

[3]

BALASUNDARAM V, 1989, SIGPLAN NOTICES, V24, P41, DOI 10.1145/74818.74822

[4]

Bershad B. N., 1993, Digest of Papers. COMPCON Spring '93 (Cat. No.93CH3251-6), P528, DOI 10.1109/CMPCON.1993.289730

[5]

Cadar C., 2008, Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI'08, (USA), P209

[6] ANALYSIS OF INTERPROCEDURAL SIDE-EFFECTS IN A PARALLEL PROGRAMMING ENVIRONMENT [J].

CALLAHAN, D ;

KENNEDY, K .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1988, 5 (05) :517-550

[7]

Danalis A., 2010, P 3 WORKSH GEN PURP, P63, DOI [10.1145/1735688.1735702, DOI 10.1145/1735688.1735702]

[8]

Dantzig G.B., 1973, J COMBINATORIAL TH A, V14, P288, DOI DOI 10.1016/0097-3165(73)90004-6

[9] THE CLOUDS DISTRIBUTED OPERATING SYSTEM [J].

DASGUPTA, P ;

LEBLANC, RJ .

COMPUTER, 1991, 24 (11) :34-44

[10]

Delp G. S., 1988, Computer Communication Review, V18, P165, DOI 10.1145/52325.52342

← 1 2 3 →