Identifying and characterizing highly similar notes in big clinical note datasets

被引:14
作者
Gabriel, Rodney A. [1 ,2 ]
Kuo, Tsung-Ting [1 ]
McAuley, Julian [3 ]
Hsu, Chun-Nan [1 ]
机构
[1] Univ Calif San Diego, UCSD Hlth Dept Biomed Informat, 9500 Gilman Dr, La Jolla, CA 92093 USA
[2] Univ Calif San Diego, Dept Anesthesiol, 200 West Arbor Dr, San Diego, CA 92103 USA
[3] Univ Calif San Diego, Dept Comp Sci & Engn, 9500 Gilman Dr, La Jolla, CA 92093 USA
关键词
Electronic medical record; De-deduplication; Natural language processing;
D O I
10.1016/j.jbi.2018.04.009
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: Big clinical note datasets found in electronic health records (EHR) present substantial opportunities to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to-exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable algorithm to de-duplicate notes and further characterize the sources of duplication. Methods: We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a clustering method using tree-structured disjoint sets; and (3) classification of near-duplicates (exact copies, common machine output notes, or similar notes) via pairwise comparison of notes in each cluster. We use the Jaccard Similarity (JS) to measure similarity between two documents. We analyzed two big clinical note datasets: our institutional dataset and MIMIC-III. Results: There were 1,528,940 notes analyzed from our institution. The de-duplication algorithm completed in 36.3 h. When the JS threshold was set at 0.7, the total number of clusters was 82,371 (total notes = 304,418). Among all JS thresholds, no clusters contained pairs of notes that were incorrectly clustered. When the JS threshold was set at 0.9 or 1.0, the de-duplication algorithm captured 100% of all random pairs with their JS at least as high as the set thresholds from the validation set. Similar performance was noted when analyzing the MIMIC-III dataset. Conclusions: We showed that among the EHR from our institution and from the publicly-available MIMIC-III dataset, there were a significant number of near-to-exact duplicated notes.
引用
收藏
页码:63 / 69
页数:7
相关论文
共 26 条
  • [1] Fast unfolding of communities in large networks
    Blondel, Vincent D.
    Guillaume, Jean-Loup
    Lambiotte, Renaud
    Lefebvre, Etienne
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
  • [2] The "Meaningful Use" Regulation for Electronic Health Records
    Blumenthal, David
    Tavenner, Marilyn
    [J]. NEW ENGLAND JOURNAL OF MEDICINE, 2010, 363 (06) : 501 - 504
  • [3] Broder A., 1997, P COMPRESSION COMPLE, V21
  • [4] Min-wise independent permutations
    Broder, AZ
    Charikar, M
    Frieze, AM
    Mitzenmacher, M
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2000, 60 (03) : 630 - 659
  • [5] Delcher Chris, 2016, J Registry Manag, V43, P10
  • [6] Hammond K. W., 2003, AMIA ANN S P, V269
  • [7] Joffe Erel, 2012, AMIA Annu Symp Proc, V2012, P1269
  • [8] MIMIC-III, a freely accessible critical care database
    Johnson, Alistair E. W.
    Pollard, Tom J.
    Shen, Lu
    Lehman, Li-wei H.
    Feng, Mengling
    Ghassemi, Mohammad
    Moody, Benjamin
    Szolovits, Peter
    Celi, Leo Anthony
    Mark, Roger G.
    [J]. SCIENTIFIC DATA, 2016, 3
  • [9] A proof of the triangle inequality for the Tanimoto distance
    Lipkus, AH
    [J]. JOURNAL OF MATHEMATICAL CHEMISTRY, 1999, 26 (1-3) : 263 - 265
  • [10] Matching identifiers in electronic health records: implications for duplicate records and patient safety
    McCoy, Allison B.
    Wright, Adam
    Kahn, Michael G.
    Shapiro, Jason S.
    Bernstam, Elmer Victor
    Sittig, Dean F.
    [J]. BMJ QUALITY & SAFETY, 2013, 22 (03) : 219 - 224