Understanding Data Similarity in Large-Scale Scientific Datasets

被引:0
|
作者
Linton, Payton [1 ]
Melodia, William [1 ]
Lazar, Alina [1 ]
Agarwal, Deborah [2 ]
Bianchi, Ludovico [2 ]
Ghoshal, Devarshi [2 ]
Pastorello, Gilbert [2 ]
Ramakrishnan, Lavanya [2 ]
Wu, Kesheng [2 ]
机构
[1] Youngstown State Univ, Youngstown, OH 44555 USA
[2] Lawrence Berkeley Natl Lab, Berkeley, CA USA
关键词
dimensionality reduction; clustering; similarity measure; TIME; FLUXES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Today, scientific experiments and simulations produce massive amounts of heterogeneous data that need to be stored and analyzed. Given that these large datasets are stored in many tiles, formats and locations, how can scientists find relevant data, duplicates or similarities? In this context, we concentrate on developing algorithms to compare similarity of time series for the purpose of search, classification and clustering. For example, generating accurate patterns from climate related lime series is important not only for building models for weather forecasting and climate prediction. but also for modeling and predicting the cycle of carbon. water. and energy. We developed the methodology and ran an exploratory analysis of climatic and ecosystem variables from the FLUXNET2015 dataset. The proposed combination of similarity metrics, nonlinear dimension reduction, clustering methods and validity measures for time series data has never been applied to unlabeled datasets before, and provides a process that can be easily extended to other scientific lime series data. The dimensionality reduction step provides a good way to identify the optimum number of clusters, detect outliers and assign initial labels to the time series data. We evaluated multiple similarity metrics, in terms of the internal cluster validity for driver as well as response variables. While the best metric often depends an a number if factor, the Euclidean distance seems to perform well for most variables and also in terms if computational expense.
引用
收藏
页码:4525 / 4531
页数:7
相关论文
共 50 条
  • [1] LARGE-SCALE DATASETS FOR GOING DEEPER IN IMAGE UNDERSTANDING
    Wu, Jiahong
    Zheng, He
    Zhao, Bo
    Li, Yixin
    Yan, Baoming
    Liang, Rui
    Wang, Wenjia
    Zhou, Shipei
    Lin, Guosen
    Fu, Yanwei
    Wang, Yizhou
    Wang, Yonggang
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1480 - 1485
  • [2] Big Data Analytics on Large-Scale Scientific Datasets in the INDIGO-DataCloud Project
    Fiore, Sandro
    Palazzo, Cosimo
    D'Anca, Alessandro
    Elia, Donatello
    Londero, Elisa
    Knapic, Cristina
    Monna, Stephen
    Marcucci, Nicola M.
    Aguilar, Fernando
    Plociennik, Marcin
    De Lucas, Jesus E. Marco
    Aloisio, Giovanni
    ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2017, 2017, : 343 - 348
  • [3] Pyramid: A General Framework for Distributed Similarity Search on Large-scale Datasets
    Deng, Shiyuan
    Yan, Xiao
    Ng, Kelvin K. W.
    Jiang, Chenyu
    Cheng, James
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 1066 - 1071
  • [4] Modelling Large-Scale Scientific Data Transfers
    Bogado J.
    Lassnig M.
    Monticelli F.
    Díaz J.
    Computing and Software for Big Science, 2022, 6 (1)
  • [5] Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets
    Morrow, Alyssa Kramer
    He, George Zhixuan
    Nothaft, Frank Austin
    Tu, Eric Tongching
    Paschall, Justin
    Yosef, Nir
    Joseph, Anthony Douglas
    CELL SYSTEMS, 2019, 9 (06) : 609 - +
  • [6] Parallel Tensor Compression for Large-Scale Scientific Data
    Austin, Woody
    Ballard, Grey
    Kolda, Tamara G.
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 912 - 922
  • [7] Mesh data management in large-scale scientific computing
    Chen, Hong
    Zheng, Winmin
    PROCEEDINGS OF THE THIRD CHINAGRID ANNUAL CONFERENCE, 2008, : 144 - 152
  • [8] Parallel visualization of large-scale multifield scientific data
    Cao, Yi
    Mo, Zeyao
    Ai, Zhiwei
    Wang, Huawei
    Xiao, Li
    Zhang, Zhe
    JOURNAL OF VISUALIZATION, 2019, 22 (06) : 1107 - 1123
  • [9] Linking Visualization and Scientific Understanding through Interactive Rendering of Large-Scale Data in Parallel Environment
    Cao, Yi
    Wang, Huawei
    Ai, Zhiwei
    2015 5TH INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV 2015), 2015, : 260 - 263
  • [10] Parallel visualization of large-scale multifield scientific data
    Yi Cao
    Zeyao Mo
    Zhiwei Ai
    Huawei Wang
    Li Xiao
    Zhe Zhang
    Journal of Visualization, 2019, 22 : 1107 - 1123