Knowledge Graph Generation and Application for Unstructured Data Using Data Processing Pipeline

被引：0

作者：

Thushara Sukumar, Sushmi ^{[1
]}

Lung, Chung-Horng ^{[1
]}

Zaman, Marzia ^{[2
]}

Panday, Ritesh ^{[1
]}

机构：

[1] Carleton Univ, Dept Syst & Comp Engn, Ottawa K1S 5B6, ON, Canada

[2] Cistel Technol Res & Dev, Ottawa, ON K2E 7V7, Canada

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

Data mining; Data processing; Natural language processing; Knowledge graphs; Machine learning; Named entity recognition; Buffer storage; Coreference resolution graph database; knowledge graph; machine learning; named entity linking; natural language processing; Neo4j; relationship extraction; unstructured data;

D O I：

10.1109/ACCESS.2024.3462635

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the rapid advancement of technology and the vast volume of unstructured data available on the Internet, there is a pressing need to extract information from diverse data formats effectively. This is essential as valuable pieces of information may be lost. To address this issue, researchers are using Machine Learning (ML) and Natural Language Processing (NLP) techniques to extract information from unstructured text, including the utilization of Knowledge Graphs (KGs). This paper demonstrates end-to-end experimental studies of KG construction from unstructured text using open-source techniques and concrete real-world examples in different problem domains. The unstructured data underwent a text processing pipeline consisting of coreference resolution, named entity linking, and relationship extraction. The pipeline is designed to support automatic data storage in a graph database known as Neo4j. This storage includes the extracted entities and their relationships. Experiments were conducted on a real-world unstructured BBC News Dataset to analyze the outcome obtained from the pipeline. The experience can facilitate the adoption of KG creation for practitioners to capture valuable information from a large volume of unstructured text. The results from the relationship extraction step using two techniques were evaluated, including extracted entities, relationship types, accuracies of 61.4% with OpenNRE and 87% with REBEL, and processing time. Further, the data processing pipeline was applied to analyze the unstructured dataset from the Transportation Safety Board's (TSB) Findings for aviation safety analysis. The results showed that structured relationships identified through the pipeline provided valuable indicators, as they captured critical aviation safety information, such as the flight, aircraft type, event, etc. This pipeline can be fine-tuned with a domain-specific knowledge base to provide higher accuracy and better entity detection.

引用

页码：136759 / 136770

页数：12

共 35 条

[1] Andre L., 2024, 53 Important Statistics About How Much Data is Created Every Day in 2024
[2] [Anonymous], 2007, P 16 ACM C CONFERENC
[3] [Anonymous], 2009, PROC JOINT C 47 ANN
[4] [Anonymous], 2024, Transport Canada Civil Aviation Program Manual for the Civil Aviation Directorate
[5] [Anonymous], 2024, Government of Canada, Welcome to the Transportation Safety Board of Canada
[6] [Anonymous], 2024, Air Transportation Safety Investigations-Transportation Safety Board of Canada
[7] Bratanic T., 2024, From Text to a Knowledge Graph: The Information Extraction Pipeline
[8] Representing emotions with knowledge graphs for movie recommendations
Breitfuss, Arno
Errou, Karen
Kurteva, Anelia
Fensel, Anna
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 125 : 715 - 725
[9] Cabot PLH, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, P2370
[10] Carlson A, 2010, AAAI CONF ARTIF INTE, P1306

← 1 2 3 4 →