Storage optimization for large-scale distributed stream-processing systems

被引:1
作者
Hildrum, Kirsten [1 ,4 ]
Douglis, Fred [1 ,4 ]
Wolf, Joel L. [1 ,4 ]
Yu, Philip S. [1 ,4 ]
Fleischer, Lisa [2 ,5 ]
Katta, Akshay [3 ,6 ]
机构
[1] IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
[2] Dartmouth College, Box 6211, Hanover
[3] Amazon Corporation, Seattle, WA 98101
关键词
File assignment problem; Load balancing; Optimization; Storage management; Streaming systems; Theory;
D O I
10.1145/1326542.1326547
中图分类号
学科分类号
摘要
We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, both incoming data and intermediate results may need to be stored to enable analyses at unknown future times. The quantity of data of potential use would dominate even the largest storage system. Thus, a mechanism is needed to keep the data most likely to be used. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time in a prespecified way [Douglis et al.2004]. Storage space for data entering the system is reclaimed automatically by deleting data of the lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming stream of data presents a challenge. In this article we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system. The key aspects of such a scheme are quite different from those that arise in traditional file assignment problems. We further motivate this optimization problem and describe a solution, comparing its performance to other reasonable schemes via simulation experiments. © 2008 ACM.
引用
收藏
相关论文
共 34 条
[21]  
LEE L.-W., SCHEUERMANN P., VINGRALEK R., File assignment in parallel I/O systems with minimal variance of service time, IEEE Trans. Comput, 49, 2, pp. 127-140, (2000)
[22]  
HEIMER L., BARAHONA R., DIETRICH F., FASANO B., FORREST J.P., HARDER J., LADANYI R., PFENDER L., RALPHS T., SALTZMAN M., SCHIENBERG K., The COIN-OR initiative: Open-Source software accelerates operations research progress, ORMS Today, 28, 5, pp. 20-22, (2001)
[23]  
MARCH S., RHO S., Allocating data and operations to nodes in distributed database design, IEEE Trans. Knowl. Data Eng, 7, (1995)
[24]  
PATTIPATI K., WOLF J., DEB S., A calculus of variations approach to file allocation problems in computer systems, Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, (1992)
[25]  
PEREZ-DAVILA A., DOWDY L., Parameter interdependencies of file placement models in a Unix system, Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, (1984)
[26]  
PIETZUCH P., LEDLIE J., SHNEIDMAN J., ROUSSOPOULOS M., WELSH M., SELTZER M., Network-Aware operator placement for stream-processing systems, Proceedings of the 22nd International Conference on Data Engineering (ICDE), (2006)
[27]  
ROWSTRON A., DRUSCHEL P., Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility, Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), pp. 188-201, (2001)
[28]  
STONEBRAKER M., CETINTEMEL U., ZDONIK S.B., The 8 requirements of real-time stream processing, SIGMOD Rec, 34, 4, pp. 42-47, (2005)
[29]  
Streambase, (2007)
[30]  
STREAM: The Stanford stream data manager, IEEE Data Eng. Bull, 26, (2003)