Understanding Data Similarity in Large-Scale Scientific Datasets

被引:0
|
作者
Linton, Payton [1 ]
Melodia, William [1 ]
Lazar, Alina [1 ]
Agarwal, Deborah [2 ]
Bianchi, Ludovico [2 ]
Ghoshal, Devarshi [2 ]
Pastorello, Gilbert [2 ]
Ramakrishnan, Lavanya [2 ]
Wu, Kesheng [2 ]
机构
[1] Youngstown State Univ, Youngstown, OH 44555 USA
[2] Lawrence Berkeley Natl Lab, Berkeley, CA USA
关键词
dimensionality reduction; clustering; similarity measure; TIME; FLUXES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Today, scientific experiments and simulations produce massive amounts of heterogeneous data that need to be stored and analyzed. Given that these large datasets are stored in many tiles, formats and locations, how can scientists find relevant data, duplicates or similarities? In this context, we concentrate on developing algorithms to compare similarity of time series for the purpose of search, classification and clustering. For example, generating accurate patterns from climate related lime series is important not only for building models for weather forecasting and climate prediction. but also for modeling and predicting the cycle of carbon. water. and energy. We developed the methodology and ran an exploratory analysis of climatic and ecosystem variables from the FLUXNET2015 dataset. The proposed combination of similarity metrics, nonlinear dimension reduction, clustering methods and validity measures for time series data has never been applied to unlabeled datasets before, and provides a process that can be easily extended to other scientific lime series data. The dimensionality reduction step provides a good way to identify the optimum number of clusters, detect outliers and assign initial labels to the time series data. We evaluated multiple similarity metrics, in terms of the internal cluster validity for driver as well as response variables. While the best metric often depends an a number if factor, the Euclidean distance seems to perform well for most variables and also in terms if computational expense.
引用
收藏
页码:4525 / 4531
页数:7
相关论文
共 50 条
  • [31] Fast, private and verifiable: Server-aided approximate similarity computation over large-scale datasets
    Department of Information Security, Beijing Jiaotong University, Beijing
    100044, China
    不详
    AZ
    85721-0104, United States
    不详
    UT
    84322, United States
    SCC - Proc. ACM Int. Workshop Secur. Cloud Comput., Co-located Asia CCS, 1600, (29-36):
  • [32] An Analysis of Bulk Data Movement Patterns in Large-scale Scientific Collaborations
    Wu, W.
    DeMar, P.
    Bobyshev, A.
    INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2010), 2011, 331
  • [33] A case for on-line data analysis for large-scale scientific simulations
    Choudhary, A
    Modelling and Simulation 2003, 2003, : 5 - 5
  • [34] A Distributed In-situ Analysis Method for Large-scale Scientific Data
    Han, Donghyoung
    Nam, Yoon-Min
    Kim, Min-Soo
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2017, : 69 - 75
  • [35] A Virtual Dataspaces Model for large-scale materials scientific data access
    Hu, Changjun
    Li, Yang
    Cheng, Xin
    Liu, Zhenyu
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 54 : 456 - 468
  • [36] Exploiting Scientific Workflows for Large-scale Gene Expression Data Analysis
    De Stasio, Alessandro
    Ertelt, Marcus
    Kemmner, Wolfgang
    Leser, Ulf
    Ceccarelli, Michele
    2009 24TH INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2009, : 447 - +
  • [37] The Large-scale Structure of Scientific Method
    Kosso, Peter
    SCIENCE & EDUCATION, 2009, 18 (01) : 33 - 42
  • [38] The Large-scale Structure of Scientific Method
    Peter Kosso
    Science & Education, 2009, 18 : 33 - 42
  • [39] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [40] Data product configuration management and versioning in large-scale production of satellite scientific data
    Barkstrom, BR
    SOFTWARE CONFIGURATION MANAGEMENT, 2003, 2649 : 118 - 133