A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique

被引:25
作者
Yang, Runtao [1 ]
Zhang, Chengjin [1 ,2 ]
Zhang, Lina [1 ]
Gao, Rui [2 ]
机构
[1] Shandong Univ Weihai, Sch Mech Elect & Informat Engn, Weihai 264209, Peoples R China
[2] Shandong Univ, Sch Control Sci & Engn, Jinan 250061, Shandong, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
PSI-BLAST; SEQUENCE; LECTINS; SITES; PROTEINS; DATABASE; ATTRIBUTES; APOPTOSIS; DNA;
D O I
10.1155/2018/9364182
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Cancerlectins have an inhibitory effect on the growth of cancer cells and are currently being employed as therapeutic agents. The accurate identification of the cancerlectins should provide insight into the molecular mechanisms of cancers. In this study, a new computational method based on the RF (Random Forest) algorithm is proposed for further improving the performance of identifying cancerlectins. Hybrid feature space before feature selection is developed by combining different individual feature spaces, CTD (Composition, Transition, and Distribution), PseAAC (Pseudo Amino Acid Composition), PSSM (Position-Specific Scoring Matrix), and disorder. The SMOTE (Synthetic Minority Oversampling Technique) is applied to solve the imbalanced data problem. To reduce feature redundancy and computation complexity, we propose a two-step feature selection process to select informative features. A 5-fold cross-validation technique is used for the evaluation of various prediction strategies. The proposed method achieves a sensitivity of 0.779, a specificity of 0.717, an accuracy of 0.748, and anMCC (Matthew's Correlation Coefficient) of 0.497. The prediction results are also compared with other existing methods on the same dataset using 5-fold cross-validation. The comparison results demonstrate the high effectiveness of our method for predicting cancerlectins.
引用
收藏
页数:10
相关论文
共 59 条
  • [1] LECTIN-BASED GLYCOPROTEOMIC TECHNIQUES FOR THE ENRICHMENT AND IDENTIFICATION OF POTENTIAL BIOMARKERS
    Abbott, Karen L.
    Pierce, J. Michael
    [J]. METHODS IN ENZYMOLOGY, VOL 480: GLYCOBIOLOGY, 2010, 480 : 461 - 476
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] [Anonymous], ANN NEW YORK ACAD SC
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Galectin-3 expression is associated with bladder cancer progression and clinical outcome
    Canesin, Giacomo
    Gonzalez-Peramato, Pilar
    Palou, Joan
    Urrutia, Manuel
    Cordon-Cardo, Carlos
    Sanchez-Carbayo, Marta
    [J]. TUMOR BIOLOGY, 2010, 31 (04) : 277 - 285
  • [6] Predicting functionally important residues from sequence conservation
    Capra, John A.
    Singh, Mona
    [J]. BIOINFORMATICS, 2007, 23 (15) : 1875 - 1882
  • [7] A Lectin with Highly Potent Inhibitory Activity toward Breast Cancer Cells from Edible Tubers of Dioscorea opposita cv. Nagaimo
    Chan, Yau Sang
    Ng, Tzi Bun
    [J]. PLOS ONE, 2013, 8 (01):
  • [8] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [9] iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties
    Chen, Wei
    Yang, Hui
    Feng, Pengmian
    Ding, Hui
    Lin, Hao
    [J]. BIOINFORMATICS, 2017, 33 (22) : 3518 - 3523
  • [10] IACP: a sequence-based tool for identifying anticancer peptides
    Chen, Wei
    Ding, Hui
    Feng, Pengmian
    Lin, Hao
    Chou, Kuo-Chen
    [J]. ONCOTARGET, 2016, 7 (13) : 16895 - 16909