Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

被引：0

作者：

Gaye, Aboubacry ^{[1
,2
]}

Diongue, Abdou Ka ^{[1
]}

Sylla, Seydou Nourou ^{[3
]}

Diarra, Maryam ^{[2
]}

Diallo, Amadou ^{[2
]}

Talla, Cheikh ^{[2
]}

Loucoubar, Cheikh ^{[2
]}

机构：

[1] Gaston Berger Univ St Louis, Lab Studies & Res Stat & Dev, St Louis, Senegal

[2] Inst Pasteur, Clin Res & Data Sci Unit, Epidemiol, Dakar 220, Senegal

[3] Alioune Diop Univ Bambey, Informat & Commun Technol Dev, Bambey, Senegal

来源：

JOURNAL OF CLASSIFICATION | 2024年 / 41卷 / 01期

关键词：

Supervised dimension reduction; Correlation blocks; High-dimensional supervised classification; Genomic data; HAPLOTYPE BLOCKS; ASSOCIATION; LINKAGE;

D O I：

10.1007/s00357-024-09463-5

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

引用

页码：158 / 169

页数：12

共 50 条

[21] Enhanced algorithm for high-dimensional data classification
Wang, Xiaoming
Wang, Shitong
APPLIED SOFT COMPUTING, 2016, 40 : 1 - 9
[22] A Compressive Classification Framework for High-Dimensional Data
Tabassum, Muhammad Naveed
Ollila, Esa
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2020, 1 : 177 - 186
[23] A training algorithm for classification of high-dimensional data
Vieira, A
Barradas, N
NEUROCOMPUTING, 2003, 50 : 461 - 472
[24] Feature screening for survival trait with application to TCGA high-dimensional genomic data
Wang, Jie-Huei
Li, Cai-Rong
Hou, Po-Lin
PEERJ, 2022, 10
[25] Ensemble Method for Classification of High-Dimensional Data
Piao, Yongjun
Park, Hyun Woo
Jin, Cheng Hao
Ryu, Keun Ho
2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 245 - +
[26] Overlapping group screening for binary cancer classification with TCGA high-dimensional genomic data
Wang, Jie-Huei
Chen, Yi-Hau
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2023,
[27] Data complexity assessment in undersampled classification of high-dimensional biomedical data
Baumgartner, R
Somorjai, RL
PATTERN RECOGNITION LETTERS, 2006, 27 (12) : 1383 - 1389
[28] A Robust Supervised Variable Selection for Noisy High-Dimensional Data
Kalina, Jan
Schlenker, Anna
BIOMED RESEARCH INTERNATIONAL, 2015, 2015
[29] Supervised model-based visualization of high-dimensional data
Kontkanen, Petri
Lahtinen, Jussi
Myllymäki, Petri
Silander, Tomi
Tirri, Henry
Intelligent Data Analysis, 2000, 4 (3-4) : 213 - 227
[30] Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Lin, Guosheng
Shen, Chunhua
Shi, Qinfeng
van den Hengel, Anton
Suter, David
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 1971 - 1978

← 1 2 3 4 5 →