Storage optimization for large-scale distributed stream-processing systems

被引:1
作者
Hildrum, Kirsten [1 ,4 ]
Douglis, Fred [1 ,4 ]
Wolf, Joel L. [1 ,4 ]
Yu, Philip S. [1 ,4 ]
Fleischer, Lisa [2 ,5 ]
Katta, Akshay [3 ,6 ]
机构
[1] IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
[2] Dartmouth College, Box 6211, Hanover
[3] Amazon Corporation, Seattle, WA 98101
关键词
File assignment problem; Load balancing; Optimization; Storage management; Streaming systems; Theory;
D O I
10.1145/1326542.1326547
中图分类号
学科分类号
摘要
We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, both incoming data and intermediate results may need to be stored to enable analyses at unknown future times. The quantity of data of potential use would dominate even the largest storage system. Thus, a mechanism is needed to keep the data most likely to be used. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time in a prespecified way [Douglis et al.2004]. Storage space for data entering the system is reclaimed automatically by deleting data of the lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming stream of data presents a challenge. In this article we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system. The key aspects of such a scheme are quite different from those that arise in traditional file assignment problems. We further motivate this optimization problem and describe a solution, comparing its performance to other reasonable schemes via simulation experiments. © 2008 ACM.
引用
收藏
相关论文
共 34 条
[1]  
ABADI D.J., AHMAD Y., BALAZINSKA M., CENTINTEMEL U., CHERNIACK M., HWANG J.-H., LINDNER W., MASKEY A.S., RASIN A., RYVKINA E., TATBUL N., XING Y., ZDONIK S., The design of the Borealis stream processing engine, Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR), (2005)
[2]  
AHUJA R., MAGNANTI T., ORLIN J., Network Flows, (1993)
[3]  
ALVAREZ G.A., BOROWSKY E., GO S., ROMER T.H., BECKER-SZENDY R., GOLDING R., MERCHANT A., SPASOJEVIC M., VEITCH A., WILKES J., Minerva: An automated resource provisioning tool for large-scale storage systems, ACM Trans. Comput. Syst, 19, 4, pp. 483-518, (2001)
[4]  
AMINI L., JAIN N., SEHGAL A., SILBER J., VERSCHEURE O., Adaptive control of extreme-scale stream processing systems, Proceedings of IEEE International Conference on Distributed Computing Systems (ICDCS), (2006)
[5]  
BENT J., THAIN D., ARPACI-DUSSEAU A.C., ARPACI-DUSSEAU R.H., LIVNY M., Explicit control in the batch-aware distributed file system, Proceedings of the ACMIUSENIX Symposium on Networked System Design and Implementation (NSDI), pp. 365-378, (2004)
[6]  
BERTSIMAS D., TSITSIKLIS J., Introduction to Linear Optimization, (1997)
[7]  
BHAGWAN R., DOUGLIS F., HILDRUM K., KEPHART J.O., WALSH W.E., Time-Varying management of data storage, 1st Workshop on Hot Topics in System Dependability, (2005)
[8]  
BRANSON M., DOUGLIS F., FAWCETT B., LIU Z., RIABOV A., YE F., Autonomic operations in cooperative stream processing systems, Proceedings of the 2nd Workshop on Hot Topics in Autonomic Computing, (2007)
[9]  
CHANDRASEKARAN S., COOPER O., DESHPANDE A., FRANKLIN M.J., HELLERSTEIN J.M., HONG W., KRISHNAMURTHY S., MADDEN S.R., REISS F., SHAH M.A., TelegraphCQ: Continuous dataflow processing, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 668-668, (2003)
[10]  
DABEK F., KAASHOEK M.F., KARGER D., MORRIS R., STOICA I., Wide-Area cooperative storage with CFS, Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), (2001)