MLSeq: Machine learning interface for RNA-sequencing data

被引:31
|
作者
Goksuluk, Dincer [1 ,5 ]
Zararsiz, Gokmen [2 ,5 ]
Korkmaz, Selcuk [3 ,5 ]
Eldem, Vahap [4 ]
Zararsiz, Gozde Erturk [2 ]
Ozcetin, Erdener [6 ]
Ozturk, Ahmet [2 ,5 ]
Karaagaoglu, Ahmet Ergun [1 ]
机构
[1] Hacettepe Univ, Sch Med, Dept Biostat, TR-06100 Ankara, Turkey
[2] Erciyes Univ, Sch Med, Dept Biostat, TR-38030 Kayseri, Turkey
[3] Trakya Univ, Sch Med, Dept Biostat, TR-22030 Edirne, Turkey
[4] Istanbul Univ, Dept Biol, Fac Sci, TR-34452 Istanbul, Turkey
[5] Turcosa Analyt Solut Ltd Co, Erciyes Teknopk 5, TR-38030 Kayseri, Turkey
[6] Hitit Univ, Fac Engn, Dept Ind Engn, TR-19030 Corum, Turkey
关键词
RNA-Sequencing; Classification; Negative Binomial; Poisson; Linear discriminant analysis; SHRUNKEN CENTROIDS; SEQ; CLASSIFICATION; REVEALS; PACKAGE;
D O I
10.1016/j.cmpb.2019.04.007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background and Objective: In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network. Methods: Classification of RNA-sequencing data is not straightforward since raw data should be preprocessed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-sequencing data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq. Materials: Three real RNA-sequencing datasets (i.e cervical cancer, lung cancer and aging datasets) were used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as algorithms based on dicrete distributions, and voomNSC, nearest shrunken centroids (NSC) and support vector machines (SVM) were selected as algorithms based on continuous distributions for model comparisons. Each algorithm is compared using classification accuracies and sparsities on an independent test set. Results: The algorithms which are based on discrete distributions performed better in cervical cancer and aging data with accuracies above 0.92. In lung cancer data, the most of algorithms performed similar with accuracies of 0.88 except that SVM achieved 0.94 of accuracy. Our voomNSC algorithm was the most sparse algorithm, and able to select 2.2% and 6.6% of all features for cervical cancer and lung cancer datasets respectively. However, in aging data, sparse classifiers were not able to select an optimal subset of all features. Conclusion: MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:223 / 231
页数:9
相关论文
共 50 条
  • [1] Machine learning and statistical methods for clustering single-cell RNA-sequencing data
    Petegrosso, Raphael
    Li, Zhuliu
    Kuang, Rui
    BRIEFINGS IN BIOINFORMATICS, 2020, 21 (04) : 1209 - 1223
  • [2] GBMPurity: A machine learning tool for estimating glioblastoma tumor purity from bulk RNA-sequencing data
    Thomas, Morgan P. H.
    Ajaib, Shoaib
    Tanner, Georgette
    Bulpitt, Andrew J.
    Stead, Lucy F.
    NEURO-ONCOLOGY, 2025,
  • [3] ABEILLE: a novel method for ABerrant Expression Identification empLoying machine LEarning from RNA-sequencing data
    Labory, Justine
    Le Bideau, Gwendal
    Pratella, David
    Yao, Jean-Elisee
    Saadi, Samira Ait-El-Mkadem
    Bannwarth, Sylvie
    El-Hami, Loubna
    Paquis-Fluckinger, Veronique
    Bottini, Silvia
    BIOINFORMATICS, 2022, 38 (20) : 4754 - 4761
  • [4] Nonparametric clustering of RNA-sequencing data
    Lozano, Gabriel
    Atallah, Nadia
    Levine, Michael
    STATISTICAL ANALYSIS AND DATA MINING, 2023, 16 (06) : 547 - 559
  • [5] Bias detection and correction in RNA-Sequencing data
    Wei Zheng
    Lisa M Chung
    Hongyu Zhao
    BMC Bioinformatics, 12
  • [6] Bias detection and correction in RNA-Sequencing data
    Zheng, Wei
    Chung, Lisa M.
    Zhao, Hongyu
    BMC BIOINFORMATICS, 2011, 12
  • [7] classifieRc: An interactive web interface for the molecular classification of colorectal cancer from RNA-sequencing data
    Quinn, Gerard
    Sessler, Tamas
    Allen, Wendy
    Maguire, Sarah
    Dunne, Philip
    McArt, Darragh
    VanSteenhouse, Harper
    Gallagher, Peter
    Lees, Andrea
    Longley, Dan
    Seligmann, Bruce
    Wappett, Mark
    McDade, Simon
    CANCER RESEARCH, 2020, 80 (16)
  • [8] classifieR: an interactive web interface for the molecular classification of colorectal cancer from RNA-sequencing data
    Quinn, Gerard
    Sessler, Tamas
    Allen, Wendy
    Maguire, Sarah
    Dunne, Phillip
    McArt, Darragh
    VanSteenhouse, Harper
    Gallagher, Peter
    Lees, Andrea
    Longley, Dan
    Seligmann, Bruce
    Wappett, Mark
    McDade, Simon
    BRITISH JOURNAL OF CANCER, 2019, 121 : 5 - 5
  • [9] Identification of diagnostic markers for moyamoya disease by combining bulk RNA-sequencing analysis and machine learning
    Xu, Yifan
    Chen, Bing
    Guo, Zhongxiang
    Chen, Cheng
    Wang, Chao
    Zhou, Han
    Zhang, Chonghui
    Feng, Yugong
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [10] Cancer diagnosis by machine learning-powered RNA-sequencing of tumor-educated platelets
    Berenguer, Jordi
    ONCOGENE, 2019, 38 : 8 - 8