A Programmable Parallel Accelerator for Learning and Classification

被引：65

作者：

Cadambi, Srihari ^{[1
]}

Majumdar, Abhinandan ^{[1
]}

Becchi, Michela ^{[1
]}

Chakradhar, Srimat ^{[1
]}

Graf, Hans Peter ^{[1
]}

机构：

[1] NEC Labs Amer Inc, Princeton, NJ 08540 USA

来源：

PACT 2010: PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES | 2010年

关键词：

Accelerator-based systems; parallel computing; heterogeneous computing; machine learning; RECOGNITION;

D O I：

10.1145/1854273.1854309

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quadcore Xeon processor despite running at a modest 125 MHz.

引用

页码：273 / 283

页数：11

共 27 条

[1]

[Anonymous], P 20 IEEE INT C APPL

[2]

[Anonymous], 1999, ADV KERNEL METHODS S

[3]

BAI B, 2009, LEARNING RANK INFORM

[4] Scaling to the end of silicon with EDGE architectures [J].

Burger, D ;

Keckler, SW ;

McKinley, KS ;

Dahlin, M ;

John, LK ;

Lin, C ;

Moore, CR ;

Burrill, J ;

McDonald, RG ;

Yoder, R .

COMPUTER, 2004, 37 (07) :44-+

[5]

CADAMBI S, P IEEE S FCCM 2009 N

[6]

CATANZARO B, 2008, MACH LEARN 25 INT C

[7]

Chellapilla K., 2006, 10 INT WORKSHOP FRON

[8]

COLLOBERT R, 2008, P 25 INT C MACH LEAR, V307, P160

[9]

Cosatto E, 2008, P INT C PATT REC

[10] Image retrieval: Ideas, influences, and trends of the new age [J].

Datta, Ritendra ;

Joshi, Dhiraj ;

Li, Jia ;

Wang, James Z. .

ACM COMPUTING SURVEYS, 2008, 40 (02)

← 1 2 3 →