Understanding Data Similarity in Large-Scale Scientific Datasets

被引：0

作者：

Linton, Payton ^{[1
]}

Melodia, William ^{[1
]}

Lazar, Alina ^{[1
]}

Agarwal, Deborah ^{[2
]}

Bianchi, Ludovico ^{[2
]}

Ghoshal, Devarshi ^{[2
]}

Pastorello, Gilbert ^{[2
]}

Ramakrishnan, Lavanya ^{[2
]}

Wu, Kesheng ^{[2
]}

机构：

[1] Youngstown State Univ, Youngstown, OH 44555 USA

[2] Lawrence Berkeley Natl Lab, Berkeley, CA USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2019年

关键词：

dimensionality reduction; clustering; similarity measure; TIME; FLUXES;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Today, scientific experiments and simulations produce massive amounts of heterogeneous data that need to be stored and analyzed. Given that these large datasets are stored in many tiles, formats and locations, how can scientists find relevant data, duplicates or similarities? In this context, we concentrate on developing algorithms to compare similarity of time series for the purpose of search, classification and clustering. For example, generating accurate patterns from climate related lime series is important not only for building models for weather forecasting and climate prediction. but also for modeling and predicting the cycle of carbon. water. and energy. We developed the methodology and ran an exploratory analysis of climatic and ecosystem variables from the FLUXNET2015 dataset. The proposed combination of similarity metrics, nonlinear dimension reduction, clustering methods and validity measures for time series data has never been applied to unlabeled datasets before, and provides a process that can be easily extended to other scientific lime series data. The dimensionality reduction step provides a good way to identify the optimum number of clusters, detect outliers and assign initial labels to the time series data. We evaluated multiple similarity metrics, in terms of the internal cluster validity for driver as well as response variables. While the best metric often depends an a number if factor, the Euclidean distance seems to perform well for most variables and also in terms if computational expense.

引用

页码：4525 / 4531

页数：7

共 50 条

[31] Fast, private and verifiable: Server-aided approximate similarity computation over large-scale datasets
Department of Information Security, Beijing Jiaotong University, Beijing
100044, China
不详
AZ
85721-0104, United States
不详
UT
84322, United States
SCC - Proc. ACM Int. Workshop Secur. Cloud Comput., Co-located Asia CCS, 1600, (29-36):
[32] An Analysis of Bulk Data Movement Patterns in Large-scale Scientific Collaborations
Wu, W.
DeMar, P.
Bobyshev, A.
INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2010), 2011, 331
[33] A case for on-line data analysis for large-scale scientific simulations
Choudhary, A
Modelling and Simulation 2003, 2003, : 5 - 5
[34] A Distributed In-situ Analysis Method for Large-scale Scientific Data
Han, Donghyoung
Nam, Yoon-Min
Kim, Min-Soo
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2017, : 69 - 75
[35] A Virtual Dataspaces Model for large-scale materials scientific data access
Hu, Changjun
Li, Yang
Cheng, Xin
Liu, Zhenyu
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 54 : 456 - 468
[36] Exploiting Scientific Workflows for Large-scale Gene Expression Data Analysis
De Stasio, Alessandro
Ertelt, Marcus
Kemmner, Wolfgang
Leser, Ulf
Ceccarelli, Michele
2009 24TH INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2009, : 447 - +
[37] The Large-scale Structure of Scientific Method
Kosso, Peter
SCIENCE & EDUCATION, 2009, 18 (01) : 33 - 42
[38] The Large-scale Structure of Scientific Method
Peter Kosso
Science & Education, 2009, 18 : 33 - 42
[39] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
Nie, Tiezheng
Lee, Wang-chien
Shen, Derong
Yu, Ge
Kou, Yue
WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
[40] Data product configuration management and versioning in large-scale production of satellite scientific data
Barkstrom, BR
SOFTWARE CONFIGURATION MANAGEMENT, 2003, 2649 : 118 - 133

← 1 2 3 4 5 →