Verification of De-Identification Techniques for Personal Information Using Tree-Based Methods with Shapley Values

被引:12
作者
Lee, Junhak [1 ]
Jeong, Jinwoo [1 ]
Jung, Sungji [1 ]
Moon, Jihoon [1 ]
Rho, Seungmin [1 ]
机构
[1] Chung Ang Univ, Dept Ind Secur, Seoul 06974, South Korea
基金
新加坡国家研究基金会;
关键词
de-identification; medical data; machine learning; tree-based method; explainable artificial intelligence; ARTIFICIAL-INTELLIGENCE; BIG DATA; PRIVACY PROTECTION; NEURAL-NETWORK; USABILITY; AGE;
D O I
10.3390/jpm12020190
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
With the development of big data and cloud computing technologies, the importance of pseudonym information has grown. However, the tools for verifying whether the de-identification methodology is correctly applied to ensure data confidentiality and usability are insufficient. This paper proposes a verification of de-identification techniques for personal healthcare information by considering data confidentiality and usability. Data are generated and preprocessed by considering the actual statistical data, personal information datasets, and de-identification datasets based on medical data to represent the de-identification technique as a numeric dataset. Five tree-based regression models (i.e., decision tree, random forest, gradient boosting machine, extreme gradient boosting, and light gradient boosting machine) are constructed using the de-identification dataset to effectively discover nonlinear relationships between dependent and independent variables in numerical datasets. Then, the most effective model is selected from personal information data in which pseudonym processing is essential for data utilization. The Shapley additive explanation, an explainable artificial intelligence technique, is applied to the most effective model to establish pseudonym processing policies and machine learning to present a machine-learning process that selects an appropriate de-identification methodology.
引用
收藏
页数:19
相关论文
共 63 条
[1]   Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) [J].
Adadi, Amina ;
Berrada, Mohammed .
IEEE ACCESS, 2018, 6 :52138-52160
[2]  
Adnan Kiran, 2021, 2021 International Conference on Computer & Information Sciences (ICCOINS), P1, DOI 10.1109/ICCOINS49721.2021.9497187
[3]   Development of Usability Enhancement Model for Unstructured Big Data Using SLR [J].
Adnan, Kiran ;
Akbar, Rehan ;
Wang, Khor Siak .
IEEE ACCESS, 2021, 9 :87391-87409
[4]   Long-Term Wind Power Forecasting Using Tree-Based Learning Algorithms [J].
Ahmadi, Amirhossein ;
Nabipour, Mojtaba ;
Mohammadi-Ivatloo, Behnam ;
Amani, Ali Moradi ;
Rho, Seungmin ;
Piran, Md. Jalil .
IEEE ACCESS, 2020, 8 :151511-151522
[5]  
Ali O., 2016, P IEEE 7 ANN INF TEC, P1, DOI [10.1109/iemcon.2016.7746327, DOI 10.1109/IEMCON.2016.7746327]
[6]   POINTS OF SIGNIFICANCE Ensemble methods: bagging and random forests [J].
Altman, Naomi ;
Krzywinski, Martin .
NATURE METHODS, 2017, 14 (10) :933-934
[7]  
Bakir Cigdem, 2021, 2021 6th International Conference on Computer Science and Engineering (UBMK), P10, DOI 10.1109/UBMK52708.2021.9558938
[8]   Defining & assessing the quality, usability, and utilization of immunization data [J].
Bloland, Peter ;
MacNeil, Adam .
BMC PUBLIC HEALTH, 2019, 19 (1)
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]   De-identification of patient notes with recurrent neural networks [J].
Dernoncourt, Franck ;
Lee, Ji Young ;
Uzuner, Ozlem ;
Szolovits, Peter .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (03) :596-606