Discovery of significant porcine SNPs for swine breed identification by a hybrid of information gain, genetic algorithm, and frequency feature selection technique

被引:12
作者
Pasupa, Kitsuchart [1 ]
Rathasamuth, Wanthanee [1 ]
Tongsima, Sissades [2 ]
机构
[1] King Mongkuts Inst Technol Ladkrabang, Fac Informat Technol, Bangkok 10520, Thailand
[2] Natl Sci & Technol Dev Agcy, Natl Biobank Thailand, Pathum Thani 12120, Khong Luang, Thailand
关键词
Single nucleotide polymorphisms; Feature selection; Information gain; Genetic algorithm; Support vector machine; SUPPORT VECTOR MACHINE; DISCRIMINANT-ANALYSIS; MICROARRAY DATA; CLASSIFICATION;
D O I
10.1186/s12859-020-3471-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background The number of porcine Single Nucleotide Polymorphisms (SNPs) used in genetic association studies is very large, suitable for statistical testing. However, in breed classification problem, one needs to have a much smaller porcine-classifying SNPs (PCSNPs) set that could accurately classify pigs into different breeds. This study attempted to find such PCSNPs by using several combinations of feature selection and classification methods. We experimented with different combinations of feature selection methods including information gain, conventional as well as modified genetic algorithms, and our developed frequency feature selection method in combination with a common classification method, Support Vector Machine, to evaluate the method's performance. Experiments were conducted on a comprehensive data set containing SNPs from native pigs from America, Europe, Africa, and Asia including Chinese breeds, Vietnamese breeds, and hybrid breeds from Thailand. Results The best combination of feature selection methods-information gain, modified genetic algorithm, and frequency feature selection hybrid-was able to reduce the number of possible PCSNPs to only 1.62% (164 PCSNPs) of the total number of SNPs (10,210 SNPs) while maintaining a high classification accuracy (95.12%). Moreover, the near-identical performance of this PCSNPs set to those of bigger data sets as well as even the entire data set. Moreover, most PCSNPs were well-matched to a set of 94 genes in the PANTHER pathway, conforming to a suggestion by the Porcine Genomic Sequencing Initiative. Conclusions The best hybrid method truly provided a sufficiently small number of porcine SNPs that accurately classified swine breeds.
引用
收藏
页数:28
相关论文
共 37 条
  • [1] Aggarwal C.C., 2014, DATA CLASSIFICATION
  • [2] Porcine colonization of the Americas: a 60k SNP story
    Burgos-Paz, W.
    Souza, C. A.
    Megens, H. J.
    Ramayo-Caldas, Y.
    Melo, M.
    Lemus-Flores, C.
    Caal, E.
    Soto, H. W.
    Martinez, R.
    Alvarez, L. A.
    Aguirre, L.
    Iniguez, V.
    Revidatti, M. A.
    Martinez-Lopez, O. R.
    Llambi, S.
    Esteve-Codina, A.
    Rodriguez, M. C.
    Crooijmans, R. P. M. A.
    Paiva, S. R.
    Schook, L. B.
    Groenen, M. A. M.
    Perez-Enciso, M.
    [J]. HEREDITY, 2013, 110 (04) : 321 - 330
  • [3] A survey on feature selection methods
    Chandrashekar, Girish
    Sahin, Ferat
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) : 16 - 28
  • [4] Measuring the curse of dimensionality and its effects on particle swarm optimization and differential evolution
    Chen, Stephen
    Montgomery, James
    Bolufe-Roehler, Antonio
    [J]. APPLIED INTELLIGENCE, 2015, 42 (03) : 514 - 526
  • [5] A Two-Stage Feature Selection Method for Gene Expression Data
    Chuang, Li-Yeh
    Ke, Chao-Hsuan
    Chang, Hsueh-Wei
    Yang, Cheng-Hong
    [J]. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 2009, 13 (02) : 127 - 137
  • [6] El Aboudi N., 2016, ICEMIS, P1, DOI [DOI 10.1109/ICEMIS.2016.7745366, 10.1109/ICEMIS.2016.7745366]
  • [7] Feature selection for support vector machines by means of genetic algorithms
    Fröhlich, H
    Chapelle, O
    Schölkopf, B
    [J]. 15TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2003, : 142 - 148
  • [8] Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification
    Gao, Lingyun
    Ye, Mingquan
    Lu, Xiaojie
    Huang, Daobin
    [J]. GENOMICS PROTEOMICS & BIOINFORMATICS, 2017, 15 (06) : 389 - 395
  • [9] Gao Z., 2014, International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace Electronic Systems (VITAE), P1, DOI [10.1109/VITAE.2014.6934421, DOI 10.1109/VITAE.2014.6934421]
  • [10] A simple iterative algorithm for parsimonious binary kernel Fisher discrimination
    Harrison, Robert F.
    Pasupa, Kitsuchart
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2010, 13 (01) : 15 - 22