Training Data Augmentation with Data Distilled by Principal Component Analysis

被引:0
|
作者
Sirakov, Nikolay Metodiev [1 ]
Shahnewaz, Tahsin [1 ]
Nakhmani, Arie [2 ]
机构
[1] Texas A&M Univ Commerce, Dept Math, Commerce, TX 75429 USA
[2] Univ Alabama Birmingham, Dept Elect & Comp Engn, Birmingham, AL 35294 USA
基金
美国国家卫生研究院;
关键词
data; distillation; augmentation; classification; machine learning; CLASSIFICATION;
D O I
10.3390/electronics13020282
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work develops a new method for vector data augmentation. The proposed method applies principal component analysis (PCA), determines the eigenvectors of a set of training vectors for a machine learning (ML) method and uses them to generate the distilled vectors. The training and PCA-distilled vectors have the same dimension. The user chooses the number of vectors to be distilled and augmented to the set of training vectors. A statistical approach determines the lowest number of vectors to be distilled such that when augmented to the original vectors, the extended set trains an ML classifier to achieve a required accuracy. Hence, the novelty of this study is the distillation of vectors with the PCA method and their use to augment the original set of vectors. The advantage that comes from the novelty is that it increases the statistics of ML classifiers. To validate the advantage, we conducted experiments with four public databases and applied four classifiers: a neural network, logistic regression and support vector machine with linear and polynomial kernels. For the purpose of augmentation, we conducted several distillations, including nested distillation (double distillation). The latter notion means that new vectors were distilled from already distilled vectors. We trained the classifiers with three sets of vectors: the original vectors, original vectors augmented with vectors distilled by PCA and original vectors augmented with distilled PCA vectors and double distilled by PCA vectors. The experimental results are presented in the paper, and they confirm the advantage of the PCA-distilled vectors increasing the classification statistics of ML methods if the distilled vectors augment the original training vectors.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Penalized Principal Component Analysis of Microarray Data
    Nikulin, Vladimir
    McLachlan, Geoffrey J.
    COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS, 2010, 6160 : 82 - 96
  • [2] Classification of Hyperspectral Data Based on Principal Component Analysis
    Yi, Baolin
    Li, Weiwei
    Du, Jian
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2012, 15 (09): : 3771 - 3777
  • [3] On Training Road Surface Classifiers by Data Augmentation
    Salazar, Addisson
    Rodriguez, Alberto
    Vargas, Nancy
    Vergara, Luis
    APPLIED SCIENCES-BASEL, 2022, 12 (07):
  • [4] Augmentation and Evaluation of Training Data for Deep Learning
    Ding, Junhua
    Li, XinChuan
    Gudivada, Venkat N.
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2603 - 2611
  • [5] Functional classwise principal component analysis: a classification framework for functional data analysis
    Avishek Chatterjee
    Satyaki Mazumder
    Koel Das
    Data Mining and Knowledge Discovery, 2023, 37 : 552 - 594
  • [6] Functional classwise principal component analysis: a classification framework for functional data analysis
    Chatterjee, Avishek
    Mazumder, Satyaki
    Das, Koel
    DATA MINING AND KNOWLEDGE DISCOVERY, 2023, 37 (02) : 552 - 594
  • [7] Efficient tools for principal component analysis of complex data- a tutorial
    Rodionova, Oxana
    Kucheryavskiy, Sergey
    Pomerantsev, Alexey
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2021, 213
  • [8] Quantitative Analysis and Interpretation of Transient Electromagnetic Data via Principal Component Analysis
    Kass, M. Andy
    Li, Yaoguo
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2012, 50 (05): : 1910 - 1918
  • [9] A Weighted Principal Component Analysis and Its Application to Gene Expression Data
    da Costa, Joaquim F. Pinto
    Alonso, Hugo
    Roque, Luis
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2011, 8 (01) : 246 - 252
  • [10] Quantitative Analysis of a Weak Correlation between Complicated Data on the Basis of Principal Component Analysis
    Pang, Tao
    Zhang, Haitao
    Wen, Liliang
    Tang, Jun
    Zhou, Bing
    Yang, Qianxu
    Li, Yong
    Wang, Jiajun
    Chen, Aiming
    Zeng, Zhongda
    JOURNAL OF ANALYTICAL METHODS IN CHEMISTRY, 2021, 2021