Optimization Strategies for Inter-Thread Synchronization Overhead on NUMA Machine

被引:0
作者
Wu, Song [1 ]
Zhang, Jun [1 ]
Peng, Yaqiong [1 ]
Jin, Hai [1 ]
Jiang, Wenbin [1 ]
机构
[1] Huazhong Univ Sci & Technol, Cluster & Grid Comp Lab, Serv Comp Technol & Syst Lab, Sch Comp Sci & Technol, Wuhan 430074, Peoples R China
来源
2015 IEEE 34TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC) | 2015年
关键词
NUMA; data consistence; synchronization; algorithm;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Overhead caused by data consistence issue in inter-thread synchronization probably degrades the performance of parallel applications. Non-Uniform Memory Access (NUMA), as the mainstream architecture in today's multicore processor, further exacerbates this issue due to the significant overhead incurred by Remote Memory Reference (RMR). Therefore, to reduce synchronization overhead, it is important to solve the data consistence issue. In this paper, we classify the overhead into two kinds: (1) overhead incurred by algorithms themselves, and (2) overhead incurred by critical sections. To reduce two kinds of overhead on NUMA machine, we present two optimization strategies called search and backtrace (SAB) and reorder critical section and non-critical section (RCAN), respectively. In SAB, a server thread tries to search a thread coming from master NUMA node, and designates it as the new server thread. In this way, most of the time, shared data resides in the cache of master NUMA node, resulting in lower overhead caused by data consistence issue in critical section. In RCAN, each thread consecutively posts synchronization requests, followed by consecutively executing non-critical section. In this way, server threads could serve enough requests, resulting in better data locality. We design an algorithm named R-Synch based on SAB, while designing an algorithm named H-STA based on RCAN. Our evaluation with representative synchronization algorithms demonstrates the effectiveness of R-Synch and H-STA.
引用
收藏
页数:8
相关论文
共 16 条
  • [1] Anderson T. E., 1990, IEEE Transactions on Parallel and Distributed Systems, V1, P6, DOI 10.1109/71.80120
  • [2] Berger ED, 2000, ACM SIGPLAN NOTICES, V35, P117, DOI 10.1145/384264.379232
  • [3] Craig T., 1993, BUILDING FIFO PRIORI
  • [4] Dice D, 2011, SPAA 11: PROCEEDINGS OF THE TWENTY-THIRD ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P65
  • [5] Revisiting the Combining Synchronization Technique
    Fatourou, Panagiota
    Kallimanis, Nikolas D.
    [J]. ACM SIGPLAN NOTICES, 2012, 47 (08) : 257 - 266
  • [6] Fatourou P, 2011, SPAA 11: PROCEEDINGS OF THE TWENTY-THIRD ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P325
  • [7] Hackenberg Daniel, 2009, Proceedings of the 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2009), P413, DOI 10.1145/1669112.1669165
  • [8] Hendler D, 2010, SPAA '10: PROCEEDINGS OF THE TWENTY-SECOND ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P355
  • [9] Luchangco V, 2006, LECT NOTES COMPUT SC, V4128, P801
  • [10] Magnusson P., 1994, Proceedings Eighth International Parallel Processing Symposium (Cat. No.94TH0652-8), P165, DOI 10.1109/IPPS.1994.288305