An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX

被引：1

作者：

Mohammed, Wria Mohammed Salih ^{[1
,2
]}

Maa, Alaa Khalil Ju ^{[1
]}

机构：

[1] Sulaimani Polytech Univ, Tech Coll Informat, Sulaimani 46001, Kurdistan Regio, Iraq

[2] Univ Sulaimani, Coll Base Educ, St 1-Zone 501, Sulaimani, Kurdistan Regio, Iraq

来源：

ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL | 2024年 / 13卷

关键词：

Hadoop; Semantic web; GraphX; Linked data; SPARQL; HDFS; RDF; Spark;

D O I：

10.14201/adcaij.31506

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The volume of data is growing at an astonishingly high speed. Traditional techniques for storing and processing data, such as relational and centralized databases, have become inefficient and time-consuming. Linked data and the Semantic Web make internet data machinereadable. Because of the increasing volume of linked data and Semantic Web data, storing and working with them using traditional approaches is not enough, and this causes limited hardware resources. To solve this problem, storing datasets using distributed and clustered methods is essential. Hadoop can store datasets because it can use many hard disks for distributed data clustering; Apache Spark can be used for parallel data processing more efficiently than Hadoop MapReduce because Spark uses memory instead of the hard disk. Semantic Web data has been stored and processed in this paper using Apache Spark GraphX and the Hadoop Distributed File System (HDFS). Spark's in-memory processing and distributed computing enable efficient data analysis of massive datasets stored in HDFS. Spark GraphX allows graph-based semantic web data processing. The fundamental objective of this work is to provide a way for efficiently combining Semantic Web and big data technologies to utilize their combined strengths in data analysis and processing. First, the proposed approach uses the SPARQL query language to extract Semantic Web data from DBpedia datasets. DBpedia is a hugely available Semantic Web dataset built on Wikipedia. Secondly, the extracted Semantic Web data was converted to the GraphX data format; vertices and edges files were generated. The conversion process is implemented using Apache Spark GraphX. Third, both vertices and edge tables are stored in HDFS and are available for visualization and analysis operations. Furthermore, the proposed techniques improve the data storage efficiency by reducing the amount of storage space by half when converting from Semantic Web Data to a GraphX file, meaning the RDF size is around 133.8 and GraphX is 75.3. Adopting parallel data processing provided by Apache Spark in the proposed technique reduces the required data processing and analysis time. This article concludes that Apache Spark GraphX can enhance Semantic Web and Big Data technologies. We minimize data size and processing time by converting Semantic Web data to GraphX format, enabling efficient data management and seamless integration.

引用

页数：20

共 21 条

[1] Incremental Data Partitioning of RDF Data in SPARK
Agathangelos, Giannis
Troullinou, Georgia
Kondylakis, Haridimos
Stefanidis, Kostas
Plexousakis, Dimitris
[J]. SEMANTIC WEB: ESWC 2018 SATELLITE EVENTS, 2018, 11155 : 50 - 54
[2] RDF Query Answering Using Apache Spark: Review and Assessment
Agathangelos, Giannis
Troullinou, Georgia
Kondylakis, Haridimos
Stefanidis, Kostas
Plexousakis, Dimitris
[J]. 2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), 2018, : 54 - 59
[3] [Anonymous], 2013, EDBT/ICDT, DOI DOI 10.1145/2457317.2457331
[4] Azzedin F, 2013, PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON COLLABORATION TECHNOLOGIES AND SYSTEMS (CTS), P155
[5] Baby Nirmala M., 2021, With Performance Analysis, V18
[6] Banane Mouad, 2020, International Journal of Computing and Digital Systems, V9, P259, DOI 10.12785/ijcds/090211
[7] Banane M., 2019, International Journal of Engineering &Technology, V8, P288
[8] Bansod A., 2015, International Journal of Engineering and Advanced Technology, V4, P313
[9] The Semantic Web - A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities
Berners-Lee, T
Hendler, J
Lassila, O
[J]. SCIENTIFIC AMERICAN, 2001, 284 (05) : 34 - +
[10] Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis.
Donvito, Giacinto
Marzulli, Giovanni
Diacono, Domenico
[J]. 20TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2013), PARTS 1-6, 2014, 513

← 1 2 3 →