Latent Dirichlet Allocation for Classification using Gene Expression Data

被引:0
作者
Yalamanchili, Hima Bindu [1 ]
Kho, Soon Jye [1 ]
Raymer, Michael L. [1 ]
机构
[1] Wright State Univ, Knoesis Ctr, Dayton, OH 45435 USA
来源
2017 IEEE 17TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE) | 2017年
关键词
Topic modeling; Latent Dirichlet Allocation; Classification; Machine learning; Cancer; Gene expression; CANCER; CHALLENGES;
D O I
10.1109/BIBE.2017.00014
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.
引用
收藏
页码:39 / 44
页数:6
相关论文
共 31 条
[1]  
[Anonymous], 2010, GENSIM
[2]  
Azuaje F, 2000, IEEE ENG MED BIOL, V19, P119
[3]   Cancer - Gene expression in diagnosis [J].
Berns, A .
NATURE, 2000, 403 (6769) :491-492
[4]  
Bicego M., 2010, Proceedings of the 2010 ACM Symposium on Applied Computing, P1516, DOI DOI 10.1145/1774088.1774415
[5]   Investigating Topic Models' Capabilities in Expression Microarray Data Classification [J].
Bicego, Manuele ;
Lovato, Pietro ;
Perina, Alessandro ;
Fasoli, Marianna ;
Delledonne, Massimo ;
Pezzotti, Mario ;
Polverari, Annalisa ;
Murino, Vittorio .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (06) :1831-1836
[6]   Mining FDA drug labels using an unsupervised learning technique - topic modeling [J].
Bisgin, Halil ;
Liu, Zhichao ;
Fang, Hong ;
Xu, Xiaowei ;
Tong, Weida .
BMC BIOINFORMATICS, 2011, 12
[7]  
Bishop C.M., 2006, PATTERN RECOGN, V4, P738, DOI DOI 10.1117/1.2819119
[8]  
Blei David M., 2009, Text Mining: Classification, Clustering, and Applications, P71
[9]   Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling [J].
Chen, Xin ;
Hu, Xiaohua ;
Lim, Tze Y. ;
Shen, Xiajiong ;
Park, E. K. ;
Rosen, Gail L. .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (04) :980-991
[10]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO