Improving rare disease classification using imperfect knowledge graph

被引:18
作者
Li, Xuedong [1 ]
Wang, Yue [2 ]
Wang, Dongwu [3 ]
Yuan, Walter [3 ]
Peng, Dezhong [1 ]
Mei, Qiaozhu [4 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu, Peoples R China
[2] Univ North Carolina Chapel Hill, Sch Informat & Lib Sci, Chapel Hill, NC USA
[3] MobLab Inc, Pasadena, CA USA
[4] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
基金
美国国家科学基金会;
关键词
Rare disease diagnosis; Knowledge graph; Machine learning; Text classification; Extremely imbalanced data;
D O I
10.1186/s12911-019-0938-1
中图分类号
R-058 [];
学科分类号
摘要
Background: Accurately recognizing rare diseases based on symptom description is an important task in patient triage, early risk stratification, and target therapies. However, due to the very nature of rare diseases, the lack of historical data poses a great challenge to machine learning-based approaches. On the other hand, medical knowledge in automatically constructed knowledge graphs (KGs) has the potential to compensate the lack of labeled training examples. This work aims to develop a rare disease classification algorithm that makes effective use of a knowledge graph, even when the graph is imperfect. Method: We develop a text classification algorithm that represents a document as a combination of a "bag of words" and a "bag of knowledge terms," where a "knowledge term" is a term shared between the document and the subgraph of KG relevant to the disease classification task. We use two Chinese disease diagnosis corpora to evaluate the algorithm. The first one, HaoDaiFu, contains 51,374 chief complaints categorized into 805 diseases. The second data set, ChinaRe, contains 86,663 patient descriptions categorized into 44 disease categories. Results: On the two evaluation data sets, the proposed algorithm delivers robust performance and outperforms a wide range of baselines, including resampling, deep learning, and feature selection approaches. Both classification-based metric (macro-averaged F-1 score) and ranking-based metric (mean reciprocal rank) are used in evaluation. Conclusion: Medical knowledge in large-scale knowledge graphs can be effectively leveraged to improve rare diseases classification models, even when the knowledge graph is incomplete.
引用
收藏
页数:10
相关论文
共 32 条
  • [1] DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification
    Babbar, Rohit
    Schoelkopf, Bernhard
    [J]. WSDM'17: PROCEEDINGS OF THE TENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2017, : 721 - 729
  • [2] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [3] Bollacker K., 2008, P 2008 ACM SIGMOD IN, P1247
  • [4] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [5] Craswell N., ENCY DATABASE SYST
  • [6] Imbalanced Deep Learning by Minority Class Incremental Rectification
    Dong, Qi
    Gong, Shaogang
    Zhu, Xiatian
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (06) : 1367 - 1381
  • [7] FindZebra: A search engine for rare diseases
    Dragusin, Radu
    Petcu, Paula
    Lioma, Christina
    Larsen, Birger
    Jorgensen, Henrik L.
    Cox, Ingemar J.
    Hansen, Lars Kai
    Ingwersen, Peter
    Winther, Ole
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2013, 82 (06) : 528 - 538
  • [8] Druck G., 2008, SIGIR, P595
  • [9] Dwork C, 2001, P 10 INT C WORLD WID, P613, DOI [DOI 10.1145/371920.372165, 10.1145/371920.372165]
  • [10] European Commission, RAR DIS