Dredging a Data Lake: Decentralized Metadata Extraction

被引:5
作者
Skluzacek, Tyler J. [1 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
来源
MIDDLEWARE'19: PROCEEDINGS OF THE 2019 20TH INTERNATIONAL MIDDLEWARE CONFERENCE DOCTORAL SYMPOSIUM | 2019年
关键词
data lakes; serverless; metadata extraction; file systems;
D O I
10.1145/3366624.3368170
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique data management challenges. The influx of large, heterogeneous, and complex data causes repositories to become siloed or generally unsearchable-both problems not currently well-addressed by distributed file systems. In this work, we propose Xtract, a serverless middleware to extract metadata from files spread across heterogeneous edge computing resources. In my future work, we intend to study how Xtract can automatically construct file extraction workflows subject to users' cost, time, security, and compute allocation constraints. To this end, Xtract will enable the creation of a searchable centralized index across distributed data collections.
引用
收藏
页码:51 / 53
页数:3
相关论文
共 13 条
[1]   Parsl: Pervasive Parallel Programming in Python']Python [J].
Babuji, Yadu ;
Woodard, Anna ;
Li, Zhuozhao ;
Katz, Daniel S. ;
Clifford, Ben ;
Kumar, Rohan ;
Lacinski, Lukasz ;
Chard, Ryan ;
Wozniak, Justin M. ;
Foster, Ian ;
Wilde, Michael ;
Chard, Kyle .
HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, :25-36
[2]  
Blaiszik Ben, 2019, DATA ECOSYSTEM SUPPO
[3]  
Chard R., 2019, Serverless Supercomputing: High Performance Function as a Service for Science
[4]  
Egan M.P., 2003, VIZIER ONLINE DATA C, V5114
[5]  
King Gary., 2007, An introduction to the dataverse network as an infrastructure for data sharing
[6]  
Mattmann Chris., 2011, Tika in Action
[7]  
Padhy S, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P493, DOI 10.1109/BigData.2015.7363791
[8]   ScienceSearch: Enabling Search through Automatic Metadata Generation [J].
Rodrigo, Gonzalo P. ;
Henderson, Matt ;
Weber, Gunther H. ;
Ophus, Colin ;
Antypas, Katie ;
Ramakrishnan, Lavanya .
2018 IEEE 14TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE 2018), 2018, :93-104
[9]   Serverless Workflows for Indexing Large Scientific Data [J].
Skluzacek, Tyler J. ;
Chard, Ryan ;
Wong, Ryan ;
Li, Zhuozhao ;
Babuji, Yadu N. ;
Ward, Logan ;
Blaiszik, Ben ;
Chard, Kyle ;
Foster, Ian .
PROCEEDINGS OF THE 2019 FIFTH INTERNATIONAL WORKSHOP ON SERVERLESS COMPUTING (WOSC '19), 2019, :43-48
[10]   Skluma: An extensible metadata extraction pipeline for disorganized data [J].
Skluzacek, Tyler J. ;
Kumar, Rohan ;
Chard, Ryan ;
Harrison, Galen ;
Beckman, Paul ;
Chard, Kyle ;
Foster, Ian T. .
2018 IEEE 14TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE 2018), 2018, :256-266