Data Replication in Data Intensive Scientific Applications with Performance Guarantee

被引:49
作者
Nukarapu, Dharma Teja [1 ]
Tang, Bin [1 ]
Wang, Liqiang [2 ]
Lu, Shiyong [3 ]
机构
[1] Wichita State Univ, Dept Elect Engn & Comp Sci, Wichita, KS 67260 USA
[2] Univ Wyoming, Dept Comp Sci, Laramie, WY 82071 USA
[3] Wayne State Univ, Dept Comp Sci, Detroit, MI 48202 USA
基金
美国国家科学基金会;
关键词
Data intensive applications; Data Grids; data replication; algorithm design and analysis; simulations; ALGORITHMS; STRATEGY;
D O I
10.1109/TPDS.2010.207
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms. Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable and adaptive to the dynamic change of file access patterns in Data Grids.
引用
收藏
页码:1299 / 1306
页数:8
相关论文
共 47 条
[1]  
Aazami A., 2004, P INT WORKSH MULT IN
[2]  
Allcock B., 2001, P IEEE S MASS STOR S
[3]  
[Anonymous], Large Hadron Collider
[4]  
[Anonymous], P IEEE INFOCOM
[5]  
[Anonymous], P ACM MOBICOM
[6]  
[Anonymous], P ACM INT C MOD AN S
[7]  
Baev I., 2001, P ACM SIAM S DISCR A
[8]   APPROXIMATION ALGORITHMS FOR DATA PLACEMENT PROBLEMS [J].
Baev, Ivan ;
Rajaraman, Rajmohan ;
Swamy, Chaitanya .
SIAM JOURNAL ON COMPUTING, 2008, 38 (04) :1411-1429
[9]  
Bell William H., 2003, P INT WORKSH AG BAS
[10]  
Camaron DG, 2004, J GRID COMPUT, V2, P57, DOI DOI 10.1007/S10723-004-6040-6