Multi-label text mining to identify reasons for appointments to drive population health analytics at a primary care setting

被引:0
作者
Abu Lekham, Laith [1 ,2 ]
Wang, Yong [1 ]
Hey, Ellen [2 ]
Khasawneh, Mohammad T. [1 ]
机构
[1] SUNY Binghamton, Syst Sci & Ind Engn Dept, 4400 Vestal Pkwy E, Binghamton, NY 13902 USA
[2] Finger Lakes Community Hlth, 601-B W Washington St, Geneva, NY 14456 USA
关键词
Community Health; Machine Learning; Multi-Label Classification; Population Health Analysis; Primary Care; Text Mining; DISEASE;
D O I
10.1007/s00521-022-07306-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While much research has been conducted regarding population health analytics (PHA), there is limited research related to text mining in that area. In this research, a novel multi-label text mining model is developed to analyze and categorize the reasons for medical appointments at a primary medical center serving a rural population. The model converts an unstructured, unsupervised text corpus to a structured supervised multi-label text corpus by using look-up wordlists defined through expert domain knowledge (EDK). The text dataset contains the reasons patients made appointments in 2019. The appointment reasons were grouped into 27 categories. Each appointment reason text is tagged to its associated group (label) using associated look-up wordlists. Then, the tagged corpus is used to develop a multi-label text classification model using machine learning algorithms. Two resampling models (balanced classifiers and SMOTE) are considered to adjust for the unbalanced created labels. The classifiers and models are tested in three steps using validation, testing, and implementation datasets. Both models performed well, but the SMOTE model is more generalizable, reliable, and consistent than the balanced classifiers. The label-set performance measures are equal to or greater than 77.9% for the balanced classifiers and greater than 80% for the SMOTE model. The label-based testing performance measures using both models and all classifiers are generally greater than 90% for all labels. Finally, the PHA showed that the follow-up and well-check (WC) physical patients are the largest populations (32% and 16.54%, respectively). Besides, the populations are different based on some factors such as age, insurance, show rate, punctuality rate, and scheduling type, while the populations are very similar based on other factors such as ethnicity and gender.
引用
收藏
页码:14971 / 15005
页数:35
相关论文
共 45 条
  • [1] A Multi-Stage predictive model for missed appointments at outpatient primary care settings serving rural areas
    Abu Lekham, Laith
    Wang, Yong
    Hey, Ellen
    Lam, Sarah S.
    Khasawneh, Mohammad T.
    [J]. IISE TRANSACTIONS ON HEALTHCARE SYSTEMS ENGINEERING, 2021, 11 (02) : 79 - 94
  • [2] Multi-criteria text mining model for COVID-19 testing reasons and symptoms and temporal predictive model for COVID-19 test results in rural communities
    Abu Lekham, Laith
    Wang, Yong
    Hey, Ellen
    Khasawneh, Mohammad T.
    [J]. NEURAL COMPUTING & APPLICATIONS, 2022, 34 (10) : 7523 - 7536
  • [3] Agrawal R., 1994, PROC 20 INT C VERY L, V1215, P487, DOI DOI 10.5555/645920.672836
  • [4] Cardiology's problem women
    不详
    [J]. LANCET, 2019, 393 (10175) : 959 - 959
  • [5] [Anonymous], 2017, ARXIV170201460
  • [6] Bird S., 2009, NATURAL LANGUAGE PRO
  • [7] Brodersen Kay H., 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), P3121, DOI 10.1109/ICPR.2010.764
  • [8] Addressing imbalance in multilabel classification: Measures and random resampling algorithms
    Charte, Francisco
    Rivera, Antonio J.
    del Jesus, Maria J.
    Herrera, Francisco
    [J]. NEUROCOMPUTING, 2015, 163 : 3 - 16
  • [9] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [10] Davis J., 2006, PROC 23 INT C MACH L, P233, DOI [DOI 10.1145/1143844.1143874, 10.1145/1143844.1143874]