Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes

被引:1
作者
Egleston, Brian L. [1 ]
Bai, Tian [2 ]
Bleicher, Richard J. [3 ]
Taylor, Stanford J. [4 ]
Lutz, Michael H. [4 ]
Vucetic, Slobodan [2 ]
机构
[1] Temple Univ Hlth Syst, Fox Chase Canc Ctr, Biostat & Bioinformat Facil, 333 Cottman Ave, Philadelphia, PA 19111 USA
[2] Temple Univ, Dept Comp & Informat Sci, Philadelphia, PA 19122 USA
[3] Fox Chase Canc Ctr, Dept Surg Oncol, 7701 Burholme Ave, Philadelphia, PA 19111 USA
[4] Fox Chase Canc Ctr, Populat Studies Facil, 7701 Burholme Ave, Philadelphia, PA 19111 USA
关键词
cluster-corrected standard errors; electronic health records; natural language processing; probability models; word2vec; LENGTH;
D O I
10.1111/biom.13338
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The pointwise mutual information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within-patient clustering that arises from patterns of repeated words within a patient's health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400 000 clinical notes from an academic medical center.
引用
收藏
页码:1089 / 1100
页数:12
相关论文
共 20 条
  • [1] American Statistical Association, 2015, ASA STAT ROL STAT DA
  • [2] [Anonymous], 1967, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability: Weather modification
  • [3] Benoit K., 2017, STOPWORDS ONE STOP S
  • [4] Budiu R, 2007, LARGE SCALE SEMANTIC, P314
  • [5] CHURCH KW, 1990, 27TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P76
  • [6] The Relation Between Pearson's Correlation Coefficient r and Salton's Cosine Measure
    Egghe, Leo
    Leydesdorff, Loet
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (05): : 1027 - 1036
  • [7] Feinerer I, 2008, J STAT SOFTW, V25, P1
  • [8] Algorithmic statistics
    Gács, P
    Tromp, JT
    Vitányi, PMB
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2001, 47 (06) : 2443 - 2463
  • [9] Huang A., 2008, Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), V4, P9, DOI [10.1109/ICDMW.2009.61, DOI 10.1109/ICDMW.2009.61]
  • [10] Deep learning
    LeCun, Yann
    Bengio, Yoshua
    Hinton, Geoffrey
    [J]. NATURE, 2015, 521 (7553) : 436 - 444