Fast Data Delivery for Many-Core Processors

被引:18
作者
Bakhshalipour, Mohammad [1 ,2 ]
Lotfi-Kamran, Pejman [2 ]
Mazloumi, Abbas [3 ,4 ]
Samandi, Farid [1 ,2 ]
Naderan-Tahan, Mahmood [5 ]
Modarressi, Mehdi [6 ]
Sarbazi-Azad, Hamid [2 ,7 ]
机构
[1] SUT, Tehran, Iran
[2] Inst Res Fundamental Sci IPM, Sch Comp Sci, Tehran, Iran
[3] Univ Tehran, Tehran, Iran
[4] Univ Calif Riverside, Dept Comp Sci, Riverside, CA 92521 USA
[5] Shahid Chamran Univ Ahvaz SCU, Dept Comp Engn, Fac Engn, Ahvaz, Khuzestan, Iran
[6] Univ Tehran, Sch Elect & Comp Engn, Tehran, Iran
[7] SUT, Dept Comp Engn, Tehran, Iran
基金
美国国家科学基金会;
关键词
Memory system; network-on-chip; circuit switching; data prefetching; ON-CHIP;
D O I
10.1109/TC.2018.2821144
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Server workloads operate on large volumes of data. As a result, processors executing these workloads encounter frequent L1-D misses. In a many-core processor, an L1-D miss causes a request packet to be sent to an LLC slice and a response packet to be sent back to the L1-D, which results in high overhead. While prior work targeted response packets, this work focuses on accelerating the request packets. Unlike aggressive OoO cores, simpler cores used in many-core processors cannot hide the latency of L1-D request packets. We observe that LLC slices that serve L1-D misses are strongly temporally correlated. Taking advantage of this observation, we design a simple and accurate predictor. Upon the occurrence of an L1-D miss, the predictor identifies the LLC slice that will serve the next L1-D miss and a circuit will be set up for the upcoming miss request to accelerate its transmission. When the upcoming miss occurs, the resulting request can use the already established circuit for transmission to the LLC slice. We show that our proposal outperforms data prefetching mechanisms in a many-core processor due to (1) higher prediction accuracy and (2) not wasting valuable off-chip bandwidth, while requiring significantly less overhead. Using full-system simulation, we show that our proposal accelerates serving data misses by 22 percent and leads to 10 percent performance improvement over the state-of-the-art network-on-chip.
引用
收藏
页码:1416 / 1429
页数:14
相关论文
共 41 条
[1]  
Abouzied Azza., 2013, P 16 INT C EXTENDING, P1
[2]  
[Anonymous], 2013, ISPASS
[3]  
[Anonymous], 1991, P ACM IEEE C SUP SUP
[4]  
[Anonymous], 2012, NON TRADITIONAL REF
[5]   Domino Temporal Data Prefetcher [J].
Bakhshalipour, Mohammad ;
Lotfi-Kamran, Pejman ;
Sarbazi-Azad, Hamid .
2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, :131-142
[6]   An Efficient Temporal Data Prefetcher for L1 Caches [J].
Bakhshalipour, Mohammad ;
Lotfi-Kamran, Pejman ;
Sarbazi-Azad, Hamid .
IEEE COMPUTER ARCHITECTURE LETTERS, 2017, 16 (02) :99-102
[7]  
Balfour J., 2006, ICS '06: Proceedings of the 20th annual international conference on Supercomputing, P187, DOI DOI 10.1145/1183401.1183430
[8]   The PARSEC Benchmark Suite: Characterization and Architectural Implications [J].
Bienia, Christian ;
Kumar, Sanjeev ;
Singh, Jaswinder Pal ;
Li, Kai .
PACT'08: PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2008, :72-81
[9]   Microarchitecture optimizations for exploiting memory-level parallelism [J].
Chou, Y ;
Fahs, B ;
Abraham, S .
31ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2004, :76-87
[10]  
Collins JD, 2001, CONF PROC INT SYMP C, P14, DOI 10.1109/ISCA.2001.937427