Understanding Data Similarity in Large-Scale Scientific Datasets

被引:0
|
作者
Linton, Payton [1 ]
Melodia, William [1 ]
Lazar, Alina [1 ]
Agarwal, Deborah [2 ]
Bianchi, Ludovico [2 ]
Ghoshal, Devarshi [2 ]
Pastorello, Gilbert [2 ]
Ramakrishnan, Lavanya [2 ]
Wu, Kesheng [2 ]
机构
[1] Youngstown State Univ, Youngstown, OH 44555 USA
[2] Lawrence Berkeley Natl Lab, Berkeley, CA USA
关键词
dimensionality reduction; clustering; similarity measure; TIME; FLUXES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Today, scientific experiments and simulations produce massive amounts of heterogeneous data that need to be stored and analyzed. Given that these large datasets are stored in many tiles, formats and locations, how can scientists find relevant data, duplicates or similarities? In this context, we concentrate on developing algorithms to compare similarity of time series for the purpose of search, classification and clustering. For example, generating accurate patterns from climate related lime series is important not only for building models for weather forecasting and climate prediction. but also for modeling and predicting the cycle of carbon. water. and energy. We developed the methodology and ran an exploratory analysis of climatic and ecosystem variables from the FLUXNET2015 dataset. The proposed combination of similarity metrics, nonlinear dimension reduction, clustering methods and validity measures for time series data has never been applied to unlabeled datasets before, and provides a process that can be easily extended to other scientific lime series data. The dimensionality reduction step provides a good way to identify the optimum number of clusters, detect outliers and assign initial labels to the time series data. We evaluated multiple similarity metrics, in terms of the internal cluster validity for driver as well as response variables. While the best metric often depends an a number if factor, the Euclidean distance seems to perform well for most variables and also in terms if computational expense.
引用
收藏
页码:4525 / 4531
页数:7
相关论文
共 50 条
  • [21] Towards algorithmic analytics for large-scale datasets
    Bzdok, Danilo
    Nichols, Thomas E.
    Smith, Stephen M.
    NATURE MACHINE INTELLIGENCE, 2019, 1 (07) : 296 - 306
  • [22] RANSAC-SVM for Large-Scale Datasets
    Nishida, Kenji
    Kurita, Takio
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 3767 - 3770
  • [23] A simulation study of data distribution strategies for large-scale scientific data collaborations
    Al Kiswany, Samer
    Ripeanu, Matei
    2007 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, 2007, : 223 - 226
  • [24] Map Matching Algorithm for Large-scale Datasets
    Fiedler, David
    Cap, Michal
    Nykl, Jan
    Zilecky, Pavol
    ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 3, 2022, : 500 - 508
  • [25] Momentum Online LDA for Large-scale Datasets
    Ouyang, Jihong
    Lu, You
    Li, Ximing
    21ST EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2014), 2014, 263 : 1075 - 1076
  • [26] Error-Controlled Data Reduction Approach for Large-Scale Structured Datasets
    Ai Z.
    Leng J.
    Xia F.
    Wang H.
    Cao Y.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2021, 33 (12): : 1795 - 1802
  • [27] Large-Scale Datasets in Special Education Research
    Griffin, Megan M.
    Steinbrecher, Trisha D.
    USING SECONDARY DATASETS TO UNDERSTAND PERSONS WITH DEVELOPMENTAL DISABILITIES AND THEIR FAMILIES, 2013, 45 : 155 - 183
  • [28] Towards algorithmic analytics for large-scale datasets
    Danilo Bzdok
    Thomas E. Nichols
    Stephen M. Smith
    Nature Machine Intelligence, 2019, 1 : 296 - 306
  • [29] Understanding Desire to Touch Using Large-scale Twitter Data
    Ujitoko Y.
    NTT Technical Review, 2023, 21 (01): : 30 - 33
  • [30] Iterative Classification for Sanitizing Large-Scale Datasets
    Li, Bo
    Vorobeychik, Yevgeniy
    Li, Muqun
    Malin, Bradley
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 841 - 846