An Advanced Semantic Feature-Based Cross-Domain PII Detection, De-Identification, and Re-Identification Model Using Ensemble Learning

被引:0
|
作者
Kulkarni, Poornima [1 ]
Cauvery, N. K. [1 ]
Hemavathy, R. [2 ]
机构
[1] RV Coll Engn, Dept ISE, Bengaluru, India
[2] RV Coll Engn, Dept CSE, Bengaluru, India
关键词
PII Detection; machine learning; natural language processing; artificial intelligence; de-identification; PERSONALLY IDENTIFIABLE INFORMATION; PRIVACY; PROTECTION; MACHINE;
D O I
10.14569/IJACSA.2024.0151277
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The digital data being core to any system requires communication across peers and human machine interfaces; however, ensuring (data) security and privacy remains a challenge for the industries, especially under the threat of man-in-the- middle attacks, intruders and even ill-intended unauthorized access at warehouses. Almost all digital communication practices embody personally identifiable information (PII) like an individual's address, contact details, identification credentials etc. The unauthorized or ill-intended access to these PII attributes can cause major losses to the individual and therefore it is inevitable to identify and de-identify aforesaid PII elements across digital platforms to preserve privacy. Unfortunately, the diversity of PII attributes across disciplines makes it challenging for state-of-arts to perform PII detection by using a predefined dictionary. The model developed for a specific PII type can't be universally viable for other disciplines. Moreover, applying multiple dictionaries for the different disciplines can make a solution more exhaustive. To alleviate these challenges, in this paper a robust ensemble of ensemble learning assisted semantic feature driven cross- discipline PII detection and de-identification model (EESD-PII) is proposed. To achieve it, a large set of text queries encompassing diverse PII attributes including personal credentials, healthcare data, finance attributes etc. were considered for training based PII detection and classification. The input texts were processed for the different preprocessing tasks including stopping-word removal, punctuation removal, website-link removal, lower case conversion, lemmatization and tokenization. The tokenized text was processed for Word2Vec driven continuous bag-of-word (CBOW) embedding that not only provided latent feature space for analytics but also enabled de-identification to preserve security aspects. To address class-imbalance problems, synthetic minority over-sampling techniques like SMOTE, SMOTE-BL, SMOTEENN were applied. Subsequently, the resampled features were processed for the feature selection by using Wilcoxon Rank Sum Test (WRST) method that in sync with 95% confidence interval retained the most significant features. The selected features were processed for Min-Max Normalization to alleviate over-fitting and convergence problems, while the normalized feature vector was classified by using ensemble of ensemble learning model encompassing Bagging, Boosting, AdaBoost, Random Forest and Extra Tree Classifier as base classifier. The proposed model performed a consensus-based majority voting ensemble to annotate each text-query as PII or Non-PII data. The positively annotated query can later be processed for dictionary-based PII attribute masking to achieve de-identification. Though, the use of semantic embedding serves the purpose towards NLP-based PII detection, de identification and re-identification tasks. The simulation results reveal that the proposed EESD-PII model achieves PII annotation accuracy of 99.77%, precision 99.81%, recall 99.63% and F-Measure of 99.71%.
引用
收藏
页码:763 / 779
页数:17
相关论文
共 50 条
  • [31] Unsupervised Cross-domain Person re-Identification by Deep Clustering and Instance Learning
    Shao, Weizhuo
    Liu, Li
    Zhang, Huaxiang
    AICCC 2021: 2021 4TH ARTIFICIAL INTELLIGENCE AND CLOUD COMPUTING CONFERENCE, 2021, : 7 - 15
  • [32] Cross-domain person re-identification using Dual Generation Learning in camera sensor networks
    Zhang, Zhong
    Wang, Yanan
    Liu, Shuang
    AD HOC NETWORKS, 2020, 97
  • [33] Weakly Supervised Cross-Domain Person Re-Identification Algorithm Based on Small Sample Learning
    Li, Huiping
    Wang, Yan
    Zhu, Lingwei
    Wang, Wenchao
    Yin, Kangning
    Li, Ye
    Yin, Guangqiang
    ELECTRONICS, 2023, 12 (19)
  • [34] A Feature-based Approach to People Re-Identification using Skeleton Keypoints
    Munaro, Matteo
    Ghidoni, Stefano
    Dizmen, Deniz Tartaro
    Menegatti, Emanuele
    2014 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2014, : 5644 - 5651
  • [35] CAN FEATURE-BASED INDUCTIVE TRANSFER LEARNING HELP PERSON RE-IDENTIFICATION?
    Wu, Yang
    Li, Wei
    Minoh, Michihiko
    Mukunoki, Masayuki
    2013 20TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2013), 2013, : 2812 - 2816
  • [36] Cross-Domain Person Re-Identification Based on Normalized IBN-Net
    Bai, Xuemei
    Wang, Ao
    Zhang, Chenjie
    Hu, Hanping
    IEEE ACCESS, 2024, 12 : 54220 - 54228
  • [37] Cross-domain person re-identification based on background suppression and identity consistency
    Jiang, Ming
    Gao, Juntao
    Li, Pengfei
    Zhang, Min
    IET IMAGE PROCESSING, 2022, 16 (07) : 1924 - 1934
  • [38] Cross-domain person re-identification based on progressive attention and block occlusion
    Li Y.
    Cheng D.
    Li J.
    Huang J.
    Zhang J.
    Ma H.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2023, 49 (11): : 3167 - 3176
  • [39] Unsupervised Horizontal Pyramid Similarity Learning for Cross-Domain Adaptive Person Re-Identification
    Dong, Wenhui
    Qu, Peishu
    Liu, Chunsheng
    Tang, Yanke
    Gai, Ning
    IEEE ACCESS, 2021, 9 : 92901 - 92912
  • [40] Self-Supervised Agent Learning for Unsupervised Cross-Domain Person Re-Identification
    Jiang, Kongzhu
    Zhang, Tianzhu
    Zhang, Yongdong
    Wu, Feng
    Rui, Yong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 8549 - 8560