Construction of Genealogical Knowledge Graphs From Obituaries: Multitask Neural Network Extraction System

被引:18
作者
He, Kai [1 ,2 ,3 ]
Yao, Lixia [4 ]
Zhang, JiaWei [1 ,2 ,3 ]
Li, Yufei [1 ,2 ,3 ]
Li, Chen [1 ,2 ,3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xianning West Rd,27th, Xian 0086710049, Peoples R China
[2] Xi An Jiao Tong Univ, Natl Engn Lab Big Data Analyt, Xian, Peoples R China
[3] Shanxi Prov Key Lab Satellite & Terr Network Tech, Xian, Peoples R China
[4] Mayo Clin, Dept Hlth Sci Res, Rochester, MN USA
基金
中国国家自然科学基金;
关键词
genealogical knowledge graph; EHR; information extraction; genealogy; neural network;
D O I
10.2196/25670
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Genealogical information, such as that found in family trees, is imperative for biomedical research such as disease heritability and risk prediction. Researchers have used policyholder and their dependent information in medical claims data and emergency contacts in electronic health records (EHRs) to infer family relationships at a large scale. We have previously demonstrated that online obituaries can be a novel data source for building more complete and accurate family trees. Objective: Aiming at supplementing EHR data with family relationships for biomedical research, we built an end-to-end information extraction system using a multitask-based artificial neural network model to construct genealogical knowledge graphs (GKGs) from online obituaries. GKGs are enriched family trees with detailed information including age, gender, death and birth dates, and residence. Methods: Built on a predefined family relationship map consisting of 4 types of entities (eg, people's name, residence, birth date, and death date) and 71 types of relationships, we curated a corpus containing 1700 online obituaries from the metropolitan area of Minneapolis and St Paul in Minnesota. We also adopted data augmentation technology to generate additional synthetic data to alleviate the issue of data scarcity for rare family relationships. A multitask-based artificial neural network model was then built to simultaneously detect names, extract relationships between them, and assign attributes (eg, birth dates and death dates, residence, age, and gender) to each individual. In the end, we assemble related GKGs into larger ones by identifying people appearing in multiple obituaries. Results: Our system achieved satisfying precision (94.79%), recall (91.45%), and F-1 measures (93.09%) on 10-fold cross-validation. We also constructed 12,407 GKGs, with the largest one made up of 4 generations and 30 people. Conclusions: In this work, we discussed the meaning of GKGs for biomedical research, presented a new version of a corpus with a predefined family relationship map and augmented training data, and proposed a multitask deep neural system to construct and assemble GKGs. The results show our system can extract and demonstrate the potential of enriching EHR data for more genetic research. We share the source codes and system with the entire scientific community on GitHub without the corpus for privacy protection.
引用
收藏
页数:15
相关论文
共 32 条
  • [1] [Anonymous], 2008, P 25 INT C MACH LEAR
  • [2] Ethics and Privacy Implications of Using the Internet and Social Media to Recruit Participants for Health Research: A Privacy-by-Design Framework for Online Recruitment
    Bender, Jacqueline Lorene
    Cyr, Alaina B.
    Arbuckle, Luk
    Ferris, Lorraine E.
    [J]. JOURNAL OF MEDICAL INTERNET RESEARCH, 2017, 19 (04)
  • [3] Bhatia P, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P954
  • [4] The "Meaningful Use" Regulation for Electronic Health Records
    Blumenthal, David
    Tavenner, Marilyn
    [J]. NEW ENGLAND JOURNAL OF MEDICINE, 2010, 363 (06) : 501 - 504
  • [5] Multitask learning
    Caruana, R
    [J]. MACHINE LEARNING, 1997, 28 (01) : 41 - 75
  • [6] The Medicare Access And CHIP Reauthorization Act And The Corporate Transformation Of American Medicin7e
    Casalino, Lawrence P.
    [J]. HEALTH AFFAIRS, 2017, 36 (05) : 865 - 869
  • [7] Chang K., 2006, CHINA PERSPECT, P1, DOI [10.4000/chinaperspectives.603, DOI 10.4000/CHINAPERSPECTIVES.603]
  • [8] Clark K, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P1914
  • [9] Denecke K, 2015, Yearb Med Inform, V10, P137, DOI 10.15265/IY-2015-001
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171