Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution

被引：3

作者：

Galligan, Marie C. ^{[1
,2
]}

Saldova, Radka ^{[2
]}

Campbell, Matthew P. ^{[3
]}

Rudd, Pauline M. ^{[2
]}

Murphy, Thomas B. ^{[1
]}

机构：

[1] Natl Univ Ireland Univ Coll Dublin, Sch Math Sci, Dublin 4, Ireland

[2] NIBRT, NIBRT Dublin Oxford Glycobiol Lab, Dublin 4, Ireland

[3] Macquarie Univ, Biomol Frontiers Res Ctr, Dept Chem & Biomol Sci, Sydney, NSW 2109, Australia

来源：

BMC BIOINFORMATICS | 2013年 / 14卷

关键词：

Compositional data; Beta distribution; Generalized Dirichlet distribution; Variable selection; Feature selection; Correlation-based feature selection; Recursive partitioning; Glycobiology; Glycan; HILIC; Chromatography data; SERUM N-GLYCANS; VARIABLE SELECTION; PROSTATE-CANCER; PERFORMANCE; REGRESSION; GLYCOMICS; TOOLS;

D O I：

10.1186/1471-2105-14-155

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers. High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties. As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers. Results: A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of "grouping variables" that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past. Conclusions: The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method.

引用

页数：25

共 53 条

[31] Hijazi R., 2009, J. Applied Probability Statist., V4, P77
[32] Jemal A, 2011, CA-CANCER J CLIN, V61, P134, DOI [10.3322/caac.20115, 10.3322/caac.20107, 10.3322/caac.21492]
[33] Sequencing of N-linked oligosaccharides directly from protein gels: In-gel deglycosylation followed by matrix-assisted laser desorption/ionization mass spectrometry and normal-phase high-performance liquid chromatography
Kuster, B
Wheeler, SF
Hunter, AP
Dwek, RA
Harvey, DJ
[J]. ANALYTICAL BIOCHEMISTRY, 1997, 250 (01) : 82 - 101
[34] Lindsay BG., 1995, MIXTURE MODELS THEOR, V5, DOI DOI 10.1214/CBMS/1462106013
[35] LINDSTROM MJ, 1988, J AM STAT ASSOC, V83, P1014
[36] Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models
McNicholas, P. D.
Murphy, T. B.
McDaid, A. F.
Frost, D.
[J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2010, 54 (03) : 711 - 723
[37] Minka T., 2000, Estimating a Dirichlet distribution
[38] VARIABLE SELECTION AND UPDATING IN MODEL-BASED DISCRIMINANT ANALYSIS FOR HIGH DIMENSIONAL DATA WITH FOOD AUTHENTICITY APPLICATIONS
Murphy, Thomas Brendan
Dean, Nema
Raftery, Adrian E.
[J]. ANNALS OF APPLIED STATISTICS, 2010, 4 (01) : 396 - 421
[39] Frontiers in glycomics: Bioinformatics and biomarkers in disease - An NIH White Paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13,2006)
Packer, Nicolle H.
von der Lieth, Claus-Wilhelm
Aoki-Kinoshita, Kiyoko F.
Lebrilla, Carlito B.
Paulson, James C.
Raman, Rahul
Rudd, Pauline
Sasisekharan, Ram
Taniguchi, Naoyuki
York, William S.
[J]. PROTEOMICS, 2008, 8 (01) : 8 - 20
[40] Pearson Karl., 1897, P ROY SOC LONDON, V60, P489, DOI [DOI 10.1098/RSPL.1896.0076, 10.1098/rspl.1896.0076]

← 1 2 3 4 5 6 →