Partitioning multi-threaded processors with a large number of threads

被引：5

作者：

El-Moursy, A ^{[1
]}

Garg, R ^{[1
]}

Albonesi, DH ^{[1
]}

Dwarkadas, S ^{[1
]}

机构：

[1] Univ Rochester, Dept Elect & Comp Engn, Rochester, NY 14627 USA

来源：

ISPASS 2005: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE | 2005年

关键词：

D O I：

10.1109/ISPASS.2005.1430566

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous Multi-Threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Intel's Hyper-Threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way Chip Multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a Clustered Multi-Threaded (CMT) processor We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the frontend resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further making a CMP of CMT processors a highly viable alternative for the future.

引用

页码：112 / 123

页数：12

共 35 条

[1]

Agarwal V, 2000, PROCEEDING OF THE 27TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, P248, DOI [10.1145/342001.339691, 10.1109/ISCA.2000.854395]

[2]

Aletà A, 2003, 36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, P326

[3]

Balasubramonian R, 2003, CONF PROC INT SYMP C, P275

[4]

Baniasadi A, 2000, INT SYMP MICROARCH, P337, DOI 10.1109/MICRO.2000.898083

[5]

Barroso LA, 2000, PROCEEDING OF THE 27TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, P282, DOI [10.1109/ISCA.2000.854398, 10.1145/342001.339696]

[6]

Bhargava R, 2003, CONF PROC INT SYMP C, P264

[7]

Brooks D, 2000, PROCEEDING OF THE 27TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, P83, DOI [10.1145/342001.339657, 10.1109/ISCA.2000.854380]

[8]

BURGER D, 1997, TR971342 U WISC MAD

[9]

Canal R., 2000, P 6 INT S HIGH PERF, P132

[10]

Chu M, 2003, P SIGPLAN 03 C PROGR, P300

← 1 2 3 4 →