High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems

被引:1
作者
Dong, Xiaojun [1 ]
Wu, Yunshu [1 ]
Wang, Zhongqi [2 ]
Dhulipala, Laxman [2 ]
Gu, Yan [1 ]
Sun, Yihan [1 ]
机构
[1] Univ Calif Riverside, Riverside, CA 92521 USA
[2] Univ Maryland, College Pk, MD 20742 USA
来源
PROCEEDINGS OF THE 35TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, SPAA 2023 | 2023年
关键词
Semisort; Collect-reduce; Histogram; Sorting; Group-by; Parallel Algorithms; Shared-Memory Parallelism; FRAMEWORK;
D O I
10.1145/3558481.3591071
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallel algorithms. It takes input as an array of records and a function extracting a key per record, and reorders them so that records with equal keys are contiguous. Since many applications only require collecting equal values, but not fully sorting the input, semisort is broadly applicable, e.g., in string algorithms, graph analytics, and geometry processing, among many other domains. However, despite dozens of recent papers that use semisort in their theoretical analysis and the existence of an asymptotically optimal parallel semisort algorithm, most implementations of these parallel algorithms choose to implement semisort by using comparison or integer sorting in practice, due to potential performance issues in existing semisort implementations. In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface. Our approach can easily be extended to two related problems, histogram and collect-reduce. Our algorithms achieve strong speedups in practice, and importantly, outperform state-of-the-art parallel sorting and semisorting methods for almost all settings we tested, with varying input sizes, distribution, and key types. On average (geometric means), our semisort implementation is at least 1.27x faster the best of the tested baselines. We also test two important applications with real-world data, and show that our algorithms improve the performance (up to 2.13x) over existing approaches. We believe that many other parallel algorithm implementations can be accelerated using our results.
引用
收藏
页码:341 / 353
页数:13
相关论文
共 78 条
[1]   Parallel Batch-Dynamic Graph Connectivity [J].
Acar, Umut A. ;
Anderson, Daniel ;
Blelloch, Guy E. ;
Dhulipala, Laxman .
SPAA'19: PROCEEDINGS OF THE 31ST ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURESS, 2019, 2019, :381-392
[2]   THE INPUT OUTPUT COMPLEXITY OF SORTING AND RELATED PROBLEMS [J].
AGGARWAL, A ;
VITTER, JS .
COMMUNICATIONS OF THE ACM, 1988, 31 (09) :1116-1127
[3]  
Agrawal Kunal, 2014, ACM S PAR ALG ARCH S
[4]  
Ahmad Zafar, 2021, ACM S PAR ALG ARCH S, P22, DOI 10.1145/3409964.3461802
[5]   Work-Efficient Batch-Incremental Minimum Spanning Trees with Applications to the Sliding-Window Model [J].
Anderson, Daniel ;
Blelloch, Guy E. ;
Tangwongsan, Kanat .
PROCEEDINGS OF THE 32ND ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES (SPAA '20), 2020, :51-61
[6]  
Arora N. S., 2001, Theory of Computing Systems (TOCS), V34
[7]  
Artiles Javier, 2008, Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
[8]   Engineering In-place (Shared-memory) Sorting Algorithms [J].
Axtmann, Michael ;
Witt, Sascha ;
Ferizovic, Daniel ;
Sanders, Peter .
ACM TRANSACTIONS ON PARALLEL COMPUTING, 2022, 9 (01)
[9]  
Axtmann Michael, 2017, EUR S ALG ESA
[10]  
Backstrom L., 2006, KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, P44