共 34 条
Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach
被引:20
作者:
Bej, Saptarshi
[1
,2
]
Sarkar, Jit
[3
,4
]
Biswas, Saikat
[5
]
Mitra, Pabitra
[6
]
Chakrabarti, Partha
[3
,4
]
Wolkenhauer, Olaf
[1
,2
,7
]
机构:
[1] Univ Rostock, Dept Syst Biol & Bioinformat, Rostock, Germany
[2] Tech Univ Munich, Leibniz Inst Food Syst Biol, Munich, Germany
[3] CSIR Indian Inst Chem Biol, Div Cell Biol & Physiol, Kolkata, India
[4] Acad Innovat & Sci Res, Ghaziabad, India
[5] Indian Inst Technol, Adv Technol Dev Ctr, Kharagpur, W Bengal, India
[6] Indian Inst Technol, Dept Comp Sci & Engn, Kharagpur, W Bengal, India
[7] Stellenbosch Univ, Stellenbosch Inst Adv Study STIAS, Wallenberg Res Ctr, Stellenbosch, South Africa
关键词:
SOCIOECONOMIC POSITION;
FOOD GROUPS;
FOLLOW-UP;
MELLITUS;
RISK;
ASSOCIATION;
MEN;
D O I:
10.1038/s41387-022-00206-2
中图分类号:
R5 [内科学];
学科分类号:
1002 ;
100201 ;
摘要:
Background Studies on Type-2 Diabetes Mellitus (T2DM) have revealed heterogeneous sub-populations in terms of underlying pathologies. However, the identification of sub-populations in epidemiological datasets remains unexplored. We here focus on the detection of T2DM clusters in epidemiological data, specifically analysing the National Family Health Survey-4 (NFHS-4) dataset from India containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients. Methods Epidemiological data provide challenges for analysis due to the diverse types of features in it. In this case, applying the state-of-the-art dimension reduction tool UMAP conventionally was found to be ineffective for the NFHS-4 dataset, which contains diverse feature types. We implemented a distributed clustering workflow combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data. Results Our analysis reveals four significant clusters, with two of them comprising mainly of non-obese T2DM patients. These non-obese clusters have lower mean age and majorly comprises of rural residents. Surprisingly, one of the obese clusters had 90% of the T2DM patients practising a non-vegetarian diet though they did not show an increased intake of plant-based protein-rich foods. Conclusions From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. The application of UMAP-based clustering workflow for this type of dataset is novel in itself. Our findings demonstrate the presence of heterogeneity among Indian T2DM patients with regard to socio-demography and dietary patterns. From our analysis, we conclude that the existence of significant non-obese T2DM sub-populations characterized by younger age groups and economic disadvantage raises the need for different screening criteria for T2DM among rural Indian residents.
引用
收藏
页数:11
相关论文