Distributed Tensor Decomposition for Large Scale Health Analytics

被引:17
作者
He, Huan [1 ]
Henderson, Jette [2 ]
Ho, Joyce C. [1 ]
机构
[1] Emory Univ, Atlanta, GA 30322 USA
[2] CognitiveScale, Austin, TX USA
来源
WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019) | 2019年
基金
美国国家科学基金会;
关键词
Web Mining; User-Generated Content; Health Analytics; Tensor Decomposition; Distributed Algorithm; Apache Spark;
D O I
10.1145/3308558.3313548
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In the past few decades, there has been rapid growth in quantity and variety of healthcare data. These large sets of data are usually high dimensional (e.g. patients, their diagnoses, and medications to treat their diagnoses) and cannot be adequately represented as matrices. Thus, many existing algorithms can not analyze them. To accommodate these high dimensional data, tensor factorization, which can be viewed as a higher-order extension of methods like PCA, has attracted much attention and emerged as a promising solution. However, tensor factorization is a computationally expensive task, and existing methods developed to factor large tensors are not flexible enough for real-world situations. To address this scaling problem more efficiently, we introduce SG ran ite, a distributed, scalable, and sparse tensor factorization method fit through stochastic gradient descent. SG ran ite offers three contributions: (1) Scalability: it employs a block partitioning and parallel processing design and thus scales to large tensors, (2) Accuracy: we show that our method can achieve results faster without sacrificing the quality of the tensor decomposition, and (3) FlexibleConstraints: we show our approach can encompass various kinds of constraints including l(2) norm, l(1) norm, and logistic regularization. We demonstrate SGranite's capabilities in two real world use cases. In the first, we use Google searches for flu-like symptoms to characterize and predict influenza patterns. In the second, we use SGranite to extract clinically interesting sets (i.e., phenotypes) of patients from electronic health records. Through these case studies, we show SGranite has the potential to be used to rapidly characterize, predict, and manage a large multimodal datasets, thereby promising a novel, data-driven solution that can benefit very large segments of the population.
引用
收藏
页码:659 / 669
页数:11
相关论文
共 39 条
[1]   A scalable optimization approach for fitting canonical tensor decompositions [J].
Acar, Evrim ;
Dunlavy, Daniel M. ;
Kolda, Tamara G. .
JOURNAL OF CHEMOMETRICS, 2011, 25 (02) :67-86
[2]   COPA: Constrained PARAFAC2 for Sparse & Large Datasets [J].
Afshar, Ardavan ;
Perros, Ioakeim ;
Papalexakis, Evangelos E. ;
Searles, Elizabeth ;
Ho, Joyce ;
Sun, Jimeng .
CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, :793-802
[3]  
[Anonymous], 2014, KDD
[4]  
[Anonymous], 2008, ICML
[5]  
Arango M F, 2006, MAGNESIUM ACUTE TRAU
[6]  
Beutel Alex, 2014, FLEXIFACT SCALABLE F
[7]   ANALYSIS OF INDIVIDUAL DIFFERENCES IN MULTIDIMENSIONAL SCALING VIA AN N-WAY GENERALIZATION OF ECKART-YOUNG DECOMPOSITION [J].
CARROLL, JD ;
CHANG, JJ .
PSYCHOMETRIKA, 1970, 35 (03) :283-&
[8]   ON TENSORS, SPARSITY, AND NONNEGATIVE FACTORIZATIONS [J].
Chi, Eric C. ;
Kolda, Tamara G. .
SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 2012, 33 (04) :1272-1299
[9]   The effect of potassium supplementation on blood pressure in hypertensive subjects: A systematic review and meta-analysis [J].
Filippini, Tommaso ;
Violi, Federica ;
D'Amico, Roberto ;
Vinceti, Marco .
INTERNATIONAL JOURNAL OF CARDIOLOGY, 2017, 230 :127-135
[10]   DisTenC: A Distributed Algorithm for Scalable Tensor Completion on Spark [J].
Ge, Hancheng ;
Zhang, Kai ;
Alfifi, Majid ;
Hu, Xia ;
Caverlee, James .
2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, :137-148