DISTRIBUTED RELAXED QUEUE IN REMOTE MEMORY ACCESS MODEL

被引:0
作者
Paznikov, A. A. [1 ]
机构
[1] St Petersburg Electrochem Univ LETI, Dept Comp Sci & Engn, St Petersburg, Russia
来源
VESTNIK TOMSKOGO GOSUDARSTVENNOGO UNIVERSITETA-UPRAVLENIE VYCHISLITELNAJA TEHNIKA I INFORMATIKA-TOMSK STATE UNIVERSITY JOURNAL OF CONTROL AND COMPUTER SCIENCE | 2020年 / 50期
关键词
distributed queue; relaxed data structures; remote memory access; MPI; RMA;
D O I
10.17223/19988605/50/12
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Remote memory access (RMA) technique is a very attractive way for improving efficiency of high-performance computations (HPC) and simplifying parallel program development. Unlike message-passing, in RMA programs processes communicate by directly accessing each other's memories. RMA model is implemented in MPI standard and offers Partitioned Global Address Space (PGAS). Many applications have been shown to benefit from RMA communications, where a process directly accesses the memory of another process instead of sending and receiving messages. To the best of knowledge, there are no efficient scalable concurrent data structures designed in RMA model. In modern computer systems, multiple processes (threads) execute concurrently and synchronize their activities through shared (concurrent) data structures. Such data structures are therefore key building blocks in parallel programming, and their efficiency is crucial for the overall performance. Concurrent data structures are, however, much more difficult to design than their sequential counterparts, as processes executing concurrently might interleave their shared-memory steps in very complicated ways, with often unexpected outcomes. Coming up with efficient concurrent data structures for distributed environments with deep hierarchy, such as computer clusters and data centers, is challenging. A lot of prior work focused on designing efficient synchronization techniques for shared-memory systems. However, in such systems the shared memory itself may become an impediment for scalability. Moreover, shared-memory systems are not sufficient for processing of large data volumes in current applications. Hence, there is growing demand for efficient concurrent data structures in hierarchical distributed systems (supercomputers, clusters, grids). Regardless of the design, many data structures are subject to inherent sequential bottlenecks for some operations, such as the delete-min operation for priority queues or insert and remove operations for queues and stacks. A promising way to alleviate the bottleneck problem is relaxing the consistency requirements for such operations. There are evidences that, on most workloads, relaxed data structures outperform data structures with strict semantics and ensure acceptable degrees of operation reordering. However, to the best of our knowledge, nobody looked at relaxed concurrent structures in the distributed environment. As data structures to study, we consider relaxed queues. Relaxed queues do not guarantee strict FIFO order: remove operation might not remove exactly the first inserted element, but an element close to it. We propose an approach based on multiple sequential data structures distributed among the processes. This approach is well approved for shared-memory systems and outperform data structures with strong ordering. Every process can asynchronously access to the remote segment via RMA calls. Thanks to low latency of one-sided communications and hardware support of RDMA this scheme will guarantee high performance. When a process executes insert operation for the relaxed queue, it sets timestamp value and just picks (randomly or by a specified algorithm) the remote process and inserts the element along with timestamp to its data structure. When it executes a remove operation, the calling process selects a subset of other processes and remotely gets "candidate" elements from the set of processes. Finally, it chooses among candidate elements the "best" one with minimal timestamp and returns it. We evaluated developed relaxed queue on computer cluster. In experiments we compared developed distributed queue with the linked list, implemented in MPICH library (MPICH linked list) in RMA model. Throughput of developed relaxed distributed queue substantially outperforms MPICH linked lists. Optimization was achieved by reducing overheads for communications while accessing the elements of the structures. The main drawback of common distributed lists that the head pointers become sequential bottlenecks. Unlike them developed distributed queue has multiple access points distributed among the processes and no bottlenecks. In the work we also investigate the influence of candidate elements number and chosen type of synchronization on the efficiency of the queues. As expected, throughput is decreasing with increase of number of candidates. This is explained the additional overheads for locking, getting states and values from candidate queues. We also found for the relaxed queue exclusive mode of synchronization provides better throughput compared with shared mode. Thus, proposed decentralized asynchronous approach for designing relaxed distributed data structures, as we expected, eliminates bottlenecks, minimizes latency of the operations and provides high scalability of parallel programs.
引用
收藏
页码:97 / 105
页数:9
相关论文
共 20 条
[1]  
Afek Y, 2010, LECT NOTES COMPUTER, V6490
[2]  
Alistarh D, 2015, ACM SIGPLAN NOTICES, V50, P11, DOI [10.1145/2688500.2688523, 10.1145/2858788.2688523]
[3]  
Anenkov A.D., 2017, UPRAVLENIE VYCHISLIT, V39, P73, DOI [10.17223/19988605/39/10, DOI 10.17223/19988605/39/10]
[4]  
[Anonymous], 2015, ACM TRANS PARALLEL C
[5]   A parallel priority queue with constant time operations [J].
Brodal, GS ;
Traff, JL ;
Zaroliagis, CD .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1998, 49 (01) :4-21
[6]  
Dodds M, 2015, ACM SIGPLAN NOTICES, V50, P233, DOI [10.1145/2775051.2676963, 10.1145/2676726.2676963]
[7]   Quantitative Relaxation of Concurrent Data Structures [J].
Henzinger, Thomas A. ;
Kirsch, Christoph M. ;
Payer, Hannes ;
Sezgin, Ali ;
Sokolova, Ana .
ACM SIGPLAN NOTICES, 2013, 48 (01) :317-328
[8]  
Herlihy M., 2012, ART MULTIPROCESSOR P
[9]  
Kurnosov M.G., 2012, VESTNIK NIZHEGO RODS, V5, P385
[10]   High performance RDMA-based MPI implementation over InfiniBand [J].
Liu, JX ;
Wu, JS ;
Panda, DK .
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2004, 32 (03) :167-198