Kinematics of Big Biomedical Data to characterize temporal variability and seasonality of data repositories: Functional Data Analysis of data temporal evolution over non-parametric statistical manifolds

被引:14
作者
Saez, Carlos [1 ]
Garcia-Gomez, Juan M. [1 ]
机构
[1] Univ Politecn Valencia, Biomed Data Sci Lab BDSLab, Inst Univ Tecnol Informac & Comunicac ITACA, Camino de Vera S-N, E-46022 Valencia, Spain
基金
欧盟地平线“2020”;
关键词
Temporal stability; Data quality; Time series; Data reuse; Big data; Seasonality; Coordinate-free; Trajectories; Functional data analysis; Statistical manifolds; DATA QUALITY ASSESSMENT;
D O I
10.1016/j.ijmedinf.2018.09.015
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Aim: The increasing availability of Big Biomedical Data is leading to large research data samples collected over long periods of time. We propose the analysis of the kinematics of data probability distributions over time towards the characterization of data temporal variability. Methods: First, we propose a kinematic model based on the estimation of a continuous data temporal trajectory, using Functional Data Analysis over the embedding of a non-parametric statistical manifold which points represent data temporal batches, the Information Geometric Temporal (IGT) plot. This model allows measuring the velocity and acceleration of data changes. Next, we propose a coordinate-free method to characterize the oriented seasonality of data based on the parallelism of lagged velocity vectors of the data trajectory throughout the IGT space, the Auto-Parallelism of Velocity Vectors (APVV) and APVVmap. Finally, we automatically explain the maximum variance components of the IGT space coordinates by means of correlating data points with known temporal factors from the domain application. Materials: Methods are evaluated on the US National Hospital Discharge Survey open dataset, consisting of 3,25M hospital discharges between 2000 and 2010. Results: Seasonal and abrupt behaviours were present on the estimated multivariate and univariate data trajectories. The kinematic analysis revealed seasonal effects and punctual increments in data celerity, the latter mainly related to abrupt changes in coding. The APVV and APVVmap revealed oriented seasonal changes on data trajectories. For most variables, their distributions tended to change to the same direction at a 12-month period, with a peak of change of directionality at mid and end of the year. Diagnosis and Procedure codes also included a 9-month periodic component. Kinematics and APVV methods were able to detect seasonal effects on extreme temporal subgrouped data, such as in Procedure code, where Fourier and autocorrelation methods were not able to. The automated explanation of IGT space coordinates was consistent with the results provided by the kinematic and seasonal analysis. Coordinates received different meanings according to the trajectory trend, seasonality and abrupt changes. Discussion: Treating data as a particle moving over time through a multidimensional probabilistic space and studying the kinematics of its trajectory has turned out to a new temporal variability methodology. Its results on the NHDS were aligned with the dataset and population descriptions found in the literature, contributing with a novel temporal variability characterization. We have demonstrated that the APVV and APVVmat are an appropriate tool for the coordinate-free and oriented analysis of trajectories or complex multivariate signals. Conclusion: The proposed methods comprise an exploratory methodology for the characterization of data temporal variability, what may be useful for a reliable reuse of Big Biomedical Data repositories acquired over long periods of time.
引用
收藏
页码:109 / 124
页数:16
相关论文
共 50 条
  • [1] Aggarwal C. C., 2003, P 2003 ACM SIGMOD IN, P575, DOI DOI 10.1145/872757.872826
  • [2] Agniel Denis, 2018, BMJ, V361
  • [3] Amari S. - i., 2007, AM MATH SOC
  • [4] Big Data for Health
    Andreu-Perez, Javier
    Poon, Carmen C. Y.
    Merrifield, Robert D.
    Wong, Stephen T. C.
    Yang, Guang-Zhong
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2015, 19 (04) : 1193 - 1208
  • [5] [Anonymous], 2009, TIME SERIES THEORY M
  • [6] [Anonymous], 1952, Psychometrika
  • [7] [Anonymous], 1977, J MARKETING RES
  • [8] [Anonymous], MODERN MULTIDIMENSIO
  • [9] [Anonymous], 2009, J DATA INF QUALITY, DOI [10.1145/1515693.1515697, DOI 10.1145/1515693.1515697]
  • [10] [Anonymous], 1939, Statistical Method from the Viewpoint of Quality Control