Understanding Data Similarity in Large-Scale Scientific Datasets

被引:0
|
作者
Linton, Payton [1 ]
Melodia, William [1 ]
Lazar, Alina [1 ]
Agarwal, Deborah [2 ]
Bianchi, Ludovico [2 ]
Ghoshal, Devarshi [2 ]
Pastorello, Gilbert [2 ]
Ramakrishnan, Lavanya [2 ]
Wu, Kesheng [2 ]
机构
[1] Youngstown State Univ, Youngstown, OH 44555 USA
[2] Lawrence Berkeley Natl Lab, Berkeley, CA USA
关键词
dimensionality reduction; clustering; similarity measure; TIME; FLUXES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Today, scientific experiments and simulations produce massive amounts of heterogeneous data that need to be stored and analyzed. Given that these large datasets are stored in many tiles, formats and locations, how can scientists find relevant data, duplicates or similarities? In this context, we concentrate on developing algorithms to compare similarity of time series for the purpose of search, classification and clustering. For example, generating accurate patterns from climate related lime series is important not only for building models for weather forecasting and climate prediction. but also for modeling and predicting the cycle of carbon. water. and energy. We developed the methodology and ran an exploratory analysis of climatic and ecosystem variables from the FLUXNET2015 dataset. The proposed combination of similarity metrics, nonlinear dimension reduction, clustering methods and validity measures for time series data has never been applied to unlabeled datasets before, and provides a process that can be easily extended to other scientific lime series data. The dimensionality reduction step provides a good way to identify the optimum number of clusters, detect outliers and assign initial labels to the time series data. We evaluated multiple similarity metrics, in terms of the internal cluster validity for driver as well as response variables. While the best metric often depends an a number if factor, the Euclidean distance seems to perform well for most variables and also in terms if computational expense.
引用
收藏
页码:4525 / 4531
页数:7
相关论文
共 50 条
  • [41] Similarity caching in large-scale image retrieval
    Falchi, Fabrizio
    Lucchese, Claudio
    Orlando, Salvatore
    Perego, Raffaele
    Rabitti, Fausto
    INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (05) : 803 - 818
  • [42] Large-Scale Similarity Search with Optimal Transport
    Laouar, Clea
    Takezawa, Yuki
    Yamada, Makoto
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11920 - 11930
  • [43] Effective large-scale sequence similarity searches
    Claverie, JM
    COMPUTER METHODS FOR MACROMOLECULAR SEQUENCE ANALYSIS, 1996, 266 : 212 - 227
  • [44] Large-scale supervised similarity learning in networks
    Shiyu Chang
    Guo-Jun Qi
    Yingzhen Yang
    Charu C. Aggarwal
    Jiayu Zhou
    Meng Wang
    Thomas S. Huang
    Knowledge and Information Systems, 2016, 48 : 707 - 740
  • [45] Large-scale supervised similarity learning in networks
    Chang, Shiyu
    Qi, Guo-Jun
    Yang, Yingzhen
    Aggarwal, Charu C.
    Zhou, Jiayu
    Wang, Meng
    Huang, Thomas S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 48 (03) : 707 - 740
  • [46] Large-Scale Text Similarity Computing with Spark
    Bao, Xiaoan
    Dai, Shichao
    Zhang, Na
    Yu, Chenghai
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (04): : 95 - 100
  • [47] Deploying Large-Scale Datasets on-Demand in the Cloud: Treats and Tricks on Data Distribution
    Vaquero, Luis M.
    Celorio, Antonio
    Cuadrado, Felix
    Cuevas, Ruben
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2015, 3 (02) : 132 - 144
  • [48] Understanding Traffic Density from Large-Scale Web Camera Data
    Zhang, Shanghang
    Wu, Guanhang
    Costeira, Joao P.
    Moura, Jose M. F.
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4264 - 4273
  • [49] Editorial Note: Large-scale Heterogeneous Multimedia Data Computing and Understanding
    Gao, Zan
    Zhang, Hanwang
    Wang, Charles
    Yang, Yi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (17) : 22033 - 22033
  • [50] Understanding Adherence and Prescription Patterns Using Large-Scale Claims Data
    Margrét V. Bjarnadóttir
    Sana Malik
    Eberechukwu Onukwugha
    Tanisha Gooden
    Catherine Plaisant
    PharmacoEconomics, 2016, 34 : 169 - 179