ContextMiner: Mining Contextual Features for Conceptualizing Knowledge in Security Texts

被引:2
作者
Gutierrez, Luis Felipe [1 ]
Namin, Akbar [1 ]
机构
[1] Texas Tech Univ, Dept Comp Sci, Lubbock, TX 79409 USA
基金
美国国家科学基金会;
关键词
Feature extraction; Computer security; Data mining; Syntactics; Natural language processing; Machine learning; Tagging; Dependency parsing; feature extraction; machine learning; natural language processing; word embeddings;
D O I
10.1109/ACCESS.2022.3198944
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents ContextMiner, a novel natural language processing (NLP) framework to automatically capture contextual features for the purpose of extracting meaningful context-aware phrases from cybersecurity unstructured textual data. The framework utilizes basic attributes such as part-of-speech tagging, dependency parsing, and a domain-specific grammar to extract the contextual features. The effectiveness and applications of ContextMiner are evaluated and presented from two different perspectives: qualitative and quantitative. As for the qualitative analysis, our case studies show that the proposed framework is capable of retrieving additional contents from the given texts, both in a labeled and unlabeled setting, and thus building context-aware phrases in comparison with existing approaches. From a quantitative point of view, we evaluate ContextMiner as a pre-processing step to perform named entity recognition (NER). Our results show that ContextMiner reduces the corpus up to 70% while maintaining 85% of its relevant entities, with a small drop in the classification metrics. Finally, we explored the utilization of ContextMiner in the construction and reasoning of knowledge graphs.
引用
收藏
页码:85891 / 85904
页数:14
相关论文
共 35 条
[21]   The Stanford CoreNLP Natural Language Processing Toolkit [J].
Manning, Christopher D. ;
Surdeanu, Mihai ;
Bauer, John ;
Finkel, Jenny ;
Bethard, Steven J. ;
McClosky, David .
PROCEEDINGS OF 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, 2014, :55-60
[22]  
Marciniak M., 2020, P 6 INT WORKSHOP COM, P72
[23]  
Marciniak M, 2016, LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2278
[24]   Automatic keyphrase extraction: a survey and trends [J].
Merrouni, Zakariae Alami ;
Frikh, Bouchra ;
Ouhbi, Brahim .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2020, 54 (02) :391-424
[25]  
Mullner D., 2011, arXiv, DOI 10.48550/arXiv.1109.2378
[26]  
Niklaus C, 2018, Arxiv, DOI arXiv:1806.05599
[27]  
Panigrahi A, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5692
[28]  
Pennington J., 2014, P 2014 C EMP METH NA, P1532, DOI 10.3115/v1/D14-1162
[29]  
Sarker I. H., 2021, SN Comput Sci, V2, P1, DOI [DOI 10.1007/S42979-021-00557-0, 10.1007/s42979-021-00557-0]
[30]   Semantic Structure and Interpretability of Word Embeddings [J].
Senel, Lutfi Kerem ;
Utlu, Ihsan ;
Yucesoy, Veysel ;
Koc, Aykut ;
Cukur, Tolga .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) :1769-1779