A novel framework for Chinese personal sensitive information detection

被引：1

作者：

Ren, Chenglong ^{[1
]}

Lan, Xiao ^{[2
,4
]}

Chen, Xingshu ^{[1
,2
]}

Luo, Yonggang ^{[2
]}

Ruan, Shuhua ^{[1
,2
,3
]}

机构：

[1] Sichuan Univ, Sch Cyber Sci & Engn, Chengdu, Peoples R China

[2] Sichuan Univ, Cyber Sci Res Inst, Chengdu, Peoples R China

[3] Sichuan Univ, Sch Cyber Sci & Engn, Chengdu 610000, Peoples R China

[4] Sichuan Univ, Cyber Sci Res Inst, Chengdu 610000, Peoples R China

来源：

CONNECTION SCIENCE | 2024年 / 36卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Chinese; personal sensitive information; rule matching; sequence labeling; context analysis; MODEL;

D O I：

10.1080/09540091.2023.2298310

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machine learning methods to classify sensitive text. These methods face challenges in context analysis and adapting to Chinese language characteristics. This paper proposes CPSID, a method for detecting Chinese personal sensitive information. On the one hand, CPSID utilises rule matching to detect specific personal sensitive information only containing letters and numbers. More importantly, CPSID constructs a sequence labelling model named EBC (ELECTRA-BiLSTM-CRF) to detect more complex personal sensitive information that consist of Chinese characters. The EBC model uses the latest ELECTRA algorithm to implement word embedding, and uses BiLSTM and CRF models to extract personal sensitive information, which can detect Chinese sensitive entities accurately by analysing context information. The model achieves an F1 score of 94.09% on Chinese datasets, outperforming other similar models. Additionally, experiments on real data show CPSID has a better detection result than individual methods (rule matching or sequence labelling).

引用

页数：23

共 42 条

[1] Allahyari M, 2017, Arxiv, DOI arXiv:1707.02919
[2] Detecting Sensitive Information from Unstructured Text in a Data-Constrained Environment
Anand, Saurabh
Shukla, Manish
Lodha, Sachin
[J]. 2023 15TH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS, COMSNETS, 2023,
[3] Development and evaluation of an open source software tool for deidentification of pathology reports
Beckwith B.A.
Mahaadevan R.
Balis U.J.
Kuo F.
[J]. BMC Medical Informatics and Decision Making, 6 (1)
[4] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
BENGIO, Y
SIMARD, P
FRASCONI, P
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166
[5] Bengio Y, 2001, ADV NEUR IN, V13, P932
[6] Lipton ZC, 2015, Arxiv, DOI [arXiv:1506.00019, 10.48550/arXiv.1506.00019, DOI 10.48550/ARXIV.1506.00019]
[7] Enterprise data breach: causes, challenges, prevention, and future directions
Cheng, Long
Liu, Fang
Yao, Danfeng
[J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2017, 7 (05)
[8] ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS
Clark, Kevin
Luong, Minh-Thang
Le, Quoc V.
Manning, Christopher D.
[J]. INFORMATION SYSTEMS RESEARCH, 2020,
[9] Dai X, 2020, Arxiv, DOI arXiv:2010.11683
[10] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

← 1 2 3 4 5 →