Sky-Sorter: A Processing-in-Memory Architecture for Large-Scale Sorting

被引:4
作者
Zokaee, Farzaneh [1 ]
Chen, Fan [1 ]
Sun, Guangyu [2 ]
Jiang, Lei [3 ]
机构
[1] Indiana Univ, Dept Intelligent Syst Engn, Bloomington, IN 47405 USA
[2] Peking Univ, Ctr Energy Efficient Comp & Applicat CECA, Beijing 100871, Peoples R China
[3] Indiana Univ, Intelligent Syst Engn, Dept Intelligent Syst Engn, Bloomington, IN USA
关键词
Sorting; Micromagnetics; Hardware; Corporate acquisitions; Bandwidth; Throughput; System-on-chip; Processing-in-memory; large-scale sorting; SKYRMION; LOGIC;
D O I
10.1109/TC.2022.3169434
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sorting is one of the most important algorithms in computer science. Conventional CPUs, GPUs, FPGAs, and ASICs running sorting are fundamentally bottlenecked by the off-chip memory bandwidth, because of their von-Neumann architecture. Processing-near-memory (PNM) designs integrate a CPU, a GPU or an ASIC upon an HBM for sorting, but their sorting throughput are still limited by the HBM bandwidth and capacity. In this paper, we propose a skyrmion racetrack memory (SRM)-based PIM accelerator, Sky-Sorter, to enhance the sorting performance of large-scale datasets. Sky-Sorter implements samplesort which involves four steps, sampling, splitting marker sorting, partitioning, and bucket sorting. An SRM-based random number generator (TRNG) is used in Sky-Sorter to randomly sample records from the dataset. Sky-Sorter divides the large dataset into many buckets based on sampled splitting markers by our proposed SRM-based partitioner. Its partitioning throughput matches the off-chip memory bandwidth. We further designed an SRM-based sorting unit (SU) to sort all records of a bucket without introducing extra CMOS logic. Our SU uses the fast in-cell insertion characteristics of SRMs to implement and perform insertsort within a bucket. Sky-Sorter employs SUs to sort all buckets simultaneously by fully utilizing large internal array bandwidth. Compared to state-of-the-art accelerators, Sky-Sorter improves the throughput per Watt by similar to 4 x .
引用
收藏
页码:480 / 493
页数:14
相关论文
共 45 条
[1]  
[Anonymous], 2010, International Symposium on Parallel and Distributed Processing
[2]  
Chatterjee N, 2012, INT S HIGH PERF COMP, P41
[3]   FPGA-Accelerated Samplesort for Large Data Sets [J].
Chen, Han ;
Madaminov, Sergey ;
Ferdman, Michael ;
Milder, Peter .
2020 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA '20), 2020, :222-232
[4]   A 167-ps 2.34-mW Single-Cycle 64-Bit Binary Tree Comparator With Constant-Delay Logic in 65-nm CMOS [J].
Chuang, Pierce I-Jen ;
Sachdev, Manoj ;
Gaudet, Vincent C. .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2014, 61 (01) :160-171
[5]   Active interposer technology for chiplet-based advanced 3D system architectures [J].
Coudrain, Perceval ;
Charbonnier, J. ;
Garnier, A. ;
Vivet, P. ;
Velard, R. ;
Vinci, A. ;
Ponthenier, F. ;
Farcy, A. ;
Segaud, R. ;
Chausse, P. ;
Arnaud, L. ;
Lattard, D. ;
Guthmuller, E. ;
Romano, G. ;
Gueugnot, A. ;
Berger, F. ;
Beltritti, J. ;
Mourier, T. ;
Gottardi, M. ;
Minoret, S. ;
Ribiere, C. ;
Romero, G. ;
Philip, P-E ;
Exbrayat, Y. ;
Scevola, D. ;
Campos, D. ;
Argoud, M. ;
Allouti, N. ;
Eleouet, R. ;
Tortolero, C. Fuguet ;
Aumont, C. ;
Dutoit, D. ;
Legalland, C. ;
Michailos, J. ;
Cheramy, S. ;
Simon, G. .
2019 IEEE 69TH ELECTRONIC COMPONENTS AND TECHNOLOGY CONFERENCE (ECTC), 2019, :569-578
[6]   Application Exploration for 3-D Integrated Circuits: TCAM, FIFO, and FFT Case Studies [J].
Davis, W. Rhett ;
Oh, Eun Chu ;
Sule, Ambarish M. ;
Franzon, Paul D. .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2009, 17 (04) :496-506
[7]  
DeWitt D. J., 1991, 1043 U WISC MAD
[8]   Skyrmions on the track [J].
Fert, Albert ;
Cros, Vincent ;
Sampaio, Joao .
NATURE NANOTECHNOLOGY, 2013, 8 (03) :152-156
[9]   Accelerators and Coherence: An SoC Perspective [J].
Giri, Davide ;
Mantovani, Paolo ;
Carloni, Luca P. .
IEEE MICRO, 2018, 38 (06) :36-45
[10]  
Gray J., 1998, SORT BENCHMARK HOME