Distributed Data Provenance for Large-Scale Data-Intensive Computing

被引:0
|
作者
Zhao, Dongfang [1 ]
Shou, Chen [1 ]
Malik, Tanu [2 ,3 ]
Raicu, Ioan [1 ,2 ]
机构
[1] IIT, Dept Comp Sci, Chicago, IL 60616 USA
[2] Univ Chicago, Computat Sci, Chicago, IL 60637 USA
[3] Argonne Natl Lab, Math & Comp Sci Div, Argonne, IL 60439 USA
来源
2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2013年
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied similar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and provides distributed module for querying provenance. Our results on a 32-node cluster show that FusionFS+SPADE is a promising prototype with negligible provenance overhead and has promise to scale to petascale and beyond. Furthermore, FusionFS with its own storage layer for provenance capture is able to scale up to 1K nodes on BlueGene/P supercomputer.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds
    Li, Yilong
    Park, Seo Jin
    Ousterhout, John
    PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION, 2021, : 593 - 612
  • [2] Software architecture for large-scale, distributed, data-intensive systems
    Mattmann, CA
    Crichton, DJ
    Hughes, JS
    Kelly, SC
    Ramirez, PM
    FOURTH WORKING IEEE/IFIP CONFERENCE ON SOFTWARE ARCHITECTURE (WICSA 2004), PROCEEDINGS, 2004, : 255 - 264
  • [3] GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications
    Liu, Huan
    Orban, Dan
    CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS, 2008, : 295 - 305
  • [4] Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing
    Kim, Jinoh
    Chandra, Abhishek
    Weissman, Jon B.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011, 22 (08) : 1365 - 1373
  • [5] FRAMEWORK FOR DATA-INTENSIVE APPLICATIONS OPTIMIZATIONIN LARGE-SCALE DISTRIBUTED SYSTEMS
    Cirstoiu, Catalin
    Tapus, Nicolae
    UNIVERSITY POLITEHNICA OF BUCHAREST SCIENTIFIC BULLETIN SERIES C-ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, 2009, 71 (03): : 89 - 104
  • [6] Study of performance evaluation for data-intensive large-scale systems
    Liu, Ying
    Song, Huaiming
    Jiao, Limei
    AMS 2007: FIRST ASIA INTERNATIONAL CONFERENCE ON MODELLING & SIMULATION ASIA MODELLING SYMPOSIUM, PROCEEDINGS, 2007, : 270 - +
  • [7] Data-Intensive Computing Modules for Teaching Parallel and Distributed Computing
    Gowanlock, Michael
    Gallet, Benoit
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 350 - 357
  • [8] Distributed Data Access/Find System with Metadata for Data-Intensive Computing
    Ikebe, Minoru
    Inomata, Atsuo
    Fujikawa, Kazutoshi
    Sunahara, Hideki
    2008 9TH IEEE/ACM INTERNATIONAL CONFERENCE ON GRID COMPUTING, 2008, : 361 - 366
  • [9] Nebula: Distributed Edge Cloud for Data-Intensive Computing
    Ryden, Mathew
    Oh, Kwangsung
    Chandra, Abhishek
    Weissman, Jon
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON COLLABORATION TECHNOLOGIES AND SYSTEMS (CTS), 2014, : 491 - 492
  • [10] Data Provenance in Large-Scale Distribution
    Zhu, Yunan
    Che, Wei
    Shan, Chao
    Zhao, Shen
    ARTIFICIAL INTELLIGENCE AND SECURITY, ICAIS 2022, PT III, 2022, 13340 : 28 - 42