Cichlid: Efficient Large Scale RDFS/OWL Reasoning with Spark

被引：25

作者：

Gu, Rong ^{[1
]}

Wang, Shanyong ^{[1
]}

Wang, Fangfang ^{[1
]}

Yuan, Chunfeng ^{[1
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, Collaborat Innovat Ctr Novel Software Technol & I, Natl Key Lab Novel Software Technol, Nanjing 210093, Jiangsu, Peoples R China

来源：

2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) | 2015年

关键词：

semantic reasoning; parallel reasoning; RDFS; OWL; in-memory computing;

D O I：

10.1109/IPDPS.2015.14

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In the era of big data, the volume of semantic data grows rapidly. The large scale semantic data contains a lot of significant but often implicit information that needs to be derived by reasoning. The semantic data reasoning is a challenging process. On one hand, the traditional single-node reasoning systems can hardly cope with such large amount of data due to the resource limitations. On the other hand, the existing large scale reasoning systems are not very efficient and scalable due to the complexity of reasoning process. In this paper, we propose Cichlid, an efficient distributed reasoning engine for the widely-used RDFS and OWL Horst rule sets. Cichlid is built on top of Spark. It implements parallel reasoning algorithms with the Spark RDD programming model. Further, we proposed the optimized parallel RDFS reasoning algorithm from three aspects, including data partition model, the execution order of reasoning rules and removing of duplicated data. Then, for the parallel OWL reasoning process, we optimized the most time-consuming parts, including large-scale data join, the transitive closure computation and the equivalent relation computation. In addition to above optimizations at the reasoning algorithm level, we also optimized the inner Spark execution mechanism by proposing an off-heap memory storage mechanism for RDD. This system-level optimization patch has been accepted and integrated into Apache Spark 1.0. The experimental results show that Cichlid is around 10 times faster on average than the state-of-the-art distributed reasoning systems for both large scale synthetic and real-world benchmarks. The proposed reasoning algorithms and engine also achieve excellent scalability and fault tolerance.

引用

页码：700 / 709

页数：10

共 26 条

[1]

[Anonymous], 2011, Mining of Massive Datasets

[2]

[Anonymous], UCBEECS2014135

[3]

Auer S., 2007, SEMANTIC WEB ISWC 20, P722

[4]

Broekstra J, 2002, LECT NOTES COMPUT SC, V2342, P54

[5] LUBM: A benchmark for OWL knowledge base systems [J].

Guo, YB ;

Pan, ZX ;

Heflin, J .

JOURNAL OF WEB SEMANTICS, 2005, 3 (2-3) :158-182

[6]

Heino Norman, 2012, The Semantic Web. 11th International Semantic Web Conference (ISWC 2012). Proceedings, P133, DOI 10.1007/978-3-642-35176-1_9

[7] Scalable Authoritative OWL Reasoning for the Web [J].

Hogan, Aidan ;

Harth, Andreas ;

Polleres, Axel .

INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2009, 5 (02) :49-90

[8]

Kaoudi Z, 2008, LECT NOTES COMPUT SC, V5318, P499, DOI 10.1007/978-3-540-88564-1_32

[9]

Liu C, 2011, LECT NOTES COMPUT SC, V7031, P405, DOI 10.1007/978-3-642-25073-6_26

[10]

Lu J., 2007, VLDB, P1402

← 1 2 3 →