Synthetic Data by Principal Component Analysis

被引:3
|
作者
Sano, Natsuki [1 ]
机构
[1] Tokyo Univ Informat Sci, Dept Informat, Chiba, Japan
来源
20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2020) | 2020年
关键词
Statistical Disclosure Control; Synthetic Data; Principal Component Analysis; Sandglass-type Neural Networks;
D O I
10.1109/ICDMW51313.2020.00023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.
引用
收藏
页码:101 / 105
页数:5
相关论文
共 50 条
  • [1] Principal component analysis of synthetic galaxy spectra
    Ronen, S
    Aragón-Salamanca, A
    Lahav, O
    MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 1999, 303 (02) : 284 - 296
  • [2] Principal component analysis of genetic data
    Reich, David
    Price, Alkes L.
    Patterson, Nick
    NATURE GENETICS, 2008, 40 (05) : 491 - 492
  • [3] Principal Component Analysis of Thermographic Data
    Winfree, William P.
    Cramer, K. Elliott
    Zalameda, Joseph N.
    Howell, Patricia A.
    Burke, Eric R.
    THERMOSENSE: THERMAL INFRARED APPLICATIONS XXXVII, 2015, 9485
  • [4] Principal component analysis with autocorrelated data
    Zamprogno, Bartolomeu
    Reisen, Valderio A.
    Bondon, Pascal
    Aranda Cotta, Higor H.
    Reis Jr, Neyval C.
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2020, 90 (12) : 2117 - 2135
  • [5] Principal component analysis on interval data
    Federica Gioia
    Carlo N. Lauro
    Computational Statistics, 2006, 21 : 343 - 363
  • [6] PRINCIPAL COMPONENT ANALYSIS OF EPIDEMIOLOGICAL DATA
    OSAKI, J
    ISHII, F
    IWAMOTO, S
    SHINBO, S
    BIOMETRICS, 1982, 38 (04) : 1101 - 1101
  • [7] Principal component analysis on interval data
    Gioia, Federica
    Lauro, Carlo N.
    COMPUTATIONAL STATISTICS, 2006, 21 (02) : 343 - 363
  • [8] PRINCIPAL COMPONENT ANALYSIS OF COMPOSITIONAL DATA
    AITCHISON, J
    BIOMETRIKA, 1983, 70 (01) : 57 - 65
  • [9] PRINCIPAL COMPONENT ANALYSIS OF PRODUCTION DATA
    WILLIAMS, JH
    RADIO AND ELECTRONIC ENGINEER, 1974, 44 (09): : 473 - 480
  • [10] Principal component analysis for interval data
    Billard, L.
    Le-Rademacher, J.
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2012, 4 (06): : 535 - 540