共 17 条
- [2] Efficient NAS Parallel Benchmark Kernels with CUDA 2020 28TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2020), 2020, : 9 - 16
- [3] A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2016 COLLOCATED WORKSHOPS, 2016, 10049 : 174 - 188
- [5] CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 415 - 426
- [6] Enabling the CUDA Unified Memory model in Edge, Cloud and HPC offloaded GPU kernels 2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 834 - 841
- [7] MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2008, 5335 : 16 - +
- [8] Efficient mapping of dimensionality reduction designs onto heterogeneous FPGAs FCCM 2007: 15TH ANNUAL IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2007, : 141 - +
- [9] From Capabilities to Regions: Enabling Efficient Compilation of Lexical Effect Handlers PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2023, 7 (OOPSLA):
- [10] TSTC: Enabling Efficient Training via Structured Sparse Tensor Compilation 29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 884 - 889