Performance Optimization of Lustre File System Based on Reinforcement Learning

被引：0

作者：

Zhang W. ^{[1
,2
]}

Wang L. ^{[1
]}

Cheng Y. ^{[1
]}

机构：

[1] Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences, Beijing

[2] University of Chinese Academy of Sciences, Beijing

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2019年 / 56卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Deep learning; Distributed storage; Parameter adjustment; Performance tuning; Reinforcement learning;

D O I：

10.7544/issn1000-1239.2019.20180797

中图分类号：

学科分类号：

摘要：

Computing of high energy physics is a typical data-intensive application. The throughput and response time of distributed storage system are key performance indicators, and they are often the targets of performance optimization. There are a large number of parameters that can be adjusted in a distributed storage system. The setting of these parameters has great influence on the performance of the system. At present, these parameters are either set with static values or automatically tuned by some heuristic rules defined by experienced administrators. Neither of the method is optimistic taking into account the diversity of data access patterns and hardware capabilities, and the difficulty of finding heuristic rules for hundreds of interacted parameters based on human experience. In fact, if the tuning engine is regarded as an agent and the storage system is regarded as the environment, the parameter adjustment problem of the storage system can be treated as a typical sequential decision problem. Therefore, based on data access characteristics of high energy physics calculation, we propose an automated parameter tuning method using the reinforcement learning. Experiments show that in the same test environment, using the default parameters of the Lustre file system as a baseline, this method can increase the throughput by about 30%. © 2019, Science Press. All right reserved.

引用

页码：1578 / 1586

页数：8

共 13 条

[1] Cheng Y., Zhang X., Wang P., Et al., Data management challenges and event index technologies in high energy physics, Journal of Computer Research and Development, 54, 2, pp. 258-266, (2017)
[2] Han J., Kim D., Eom H., Improving the performance of lustre file system in HPC environments, Systems, pp. 84-89, (2016)
[3] Sutton R.S., Barto A.G., Reinforcement Learning: An Introduction, (2018)
[4] Mnih V., Kavukcuoglu K., Silver D., Et al., Human-level control through deep reinforcement learning, Nature, 518, 7540, pp. 529-533, (2015)
[5] Mnih V., Badia A.P., Mirza M., Et al., Asynchronous methods for deep reinforcement learning, Proc of the 33rd Int Conf on Machine Learning, pp. 1928-1937, (2016)
[6] Schulman J., Levine S., Abbeel P., Et al., Trust region policy optimization, Proc of the 32nd Int Conf on Machine Learning, pp. 1889-1897, (2015)
[7] Schulman J., Wolski F., Dhariwal P., Et al., Proximal policy optimization algorithms, (2017)
[8] Diao Y., Hellerstein J.L., Parekh S., Et al., Managing Web server performance with AutoTune agents, IBM Systems Journal, 42, 1, pp. 136-149, (2003)
[9] Jamshidi P., Casale G., An uncertainty-aware approach to optimal configuration of stream processing systems, Proc of the 24th IEEE Int Symp on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 39-48, (2016)
[10] Zhang F., Cao J., Liu L., Et al., Performance improvement of distributed systems by autotuning of the configuration parameters, Tsinghua Science and Technology, 16, 4, pp. 440-448, (2011)

← 1 2 →