ContextMiner: Mining Contextual Features for Conceptualizing Knowledge in Security Texts

被引:1
作者
Gutierrez, Luis Felipe [1 ]
Namin, Akbar [1 ]
机构
[1] Texas Tech Univ, Dept Comp Sci, Lubbock, TX 79409 USA
基金
美国国家科学基金会;
关键词
Feature extraction; Computer security; Data mining; Syntactics; Natural language processing; Machine learning; Tagging; Dependency parsing; feature extraction; machine learning; natural language processing; word embeddings;
D O I
10.1109/ACCESS.2022.3198944
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents ContextMiner, a novel natural language processing (NLP) framework to automatically capture contextual features for the purpose of extracting meaningful context-aware phrases from cybersecurity unstructured textual data. The framework utilizes basic attributes such as part-of-speech tagging, dependency parsing, and a domain-specific grammar to extract the contextual features. The effectiveness and applications of ContextMiner are evaluated and presented from two different perspectives: qualitative and quantitative. As for the qualitative analysis, our case studies show that the proposed framework is capable of retrieving additional contents from the given texts, both in a labeled and unlabeled setting, and thus building context-aware phrases in comparison with existing approaches. From a quantitative point of view, we evaluate ContextMiner as a pre-processing step to perform named entity recognition (NER). Our results show that ContextMiner reduces the corpus up to 70% while maintaining 85% of its relevant entities, with a small drop in the classification metrics. Finally, we explored the utilization of ContextMiner in the construction and reasoning of knowledge graphs.
引用
收藏
页码:85891 / 85904
页数:14
相关论文
共 35 条
  • [1] Bridges RA, 2014, Arxiv, DOI arXiv:1308.4941
  • [2] An enhanced technique of skin cancer classification using deep convolutional neural network with transfer learning models
    Ali, Md Shahin
    Miah, Md Sipon
    Haque, Jahurul
    Rahman, Md Mahbubur
    Islam, Md Khairul
    [J]. MACHINE LEARNING WITH APPLICATIONS, 2021, 5
  • [3] Buber E, 2017, 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), P337, DOI 10.1109/UBMK.2017.8093406
  • [4] An Approach to Data Reduction and Integrated Machine Classification
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    [J]. NEW GENERATION COMPUTING, 2010, 28 (01) : 21 - 40
  • [5] De Marneffe M.-C., 2008, STANFORD TYPED DEPEN
  • [6] Dependency graph for short text extraction and summarization
    Franciscus, Nigel
    Ren, Xuguang
    Stantic, Bela
    [J]. JOURNAL OF INFORMATION AND TELECOMMUNICATION, 2019, 3 (04) : 413 - 429
  • [7] Gamallo Pablo., 2012, Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, P10
  • [8] Golczynski A, 2021, Arxiv, DOI arXiv:2108.12276
  • [9] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
  • [10] Email Embeddings for Phishing Detection
    Gutierrez, Luis Felipe
    Abri, Faranak
    Armstrong, Miriam
    Namin, Akbar Siami
    Jones, Keith S.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2087 - 2092