Efficient Distributed Range Query Processing in Apache Spark

被引：4

作者：

Papadopoulos, Apostolos N. ^{[1
]}

Sioutas, Spyros ^{[2
]}

Zacharatos, Nikolaos ^{[2
]}

Zaroliagis, Christos ^{[2
,3
]}

机构：

[1] Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki, Greece

[2] Univ Patras, Dept Comp Engn & Informat, Patras, Greece

[3] Univ Patras, CTI, Patras, Greece

来源：

2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID) | 2019年

关键词：

Range queries; Indexing; Apache Spark; Performance evaluation;

D O I：

10.1109/CCGRID.2019.00073

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Range queries are important in many diverse applications. In its simplest one-dimensional form, a range query is expressed by an interval [a, b] on the real line, whereas the answer consists of all elements e is an element of[a, b]. In this work, we focus on efficient range query processing techniques in the Apache Spark engine, which is the state-of-the-art solution for big data management and analytics. We aim at developing a Spark-based indexing scheme that supports range queries in such large-scale decentralized environments and scale well w.r.t. the number of nodes and the data items stored. Towards this goal, there have been solutions in the last few years, which however turn out to be inadequate at the envisaged scale, since the classic linear or even the logarithmic complexity (for point queries) is still too expensive, whereas range query processing is even more demanding. In this paper, we go one step further and present a solution with sub-logarithmic complexity. In particular, we present SPIS (Spark-based Interpolation Search), a tree structure that outperforms the existing Spark built-in lookup techniques. We carry out an experimental evaluation by using synthetic data sets. Our experimental results demonstrate the efficiency and scalability of the proposed approach.

引用

页码：569 / 575

页数：7

共 15 条

[1]

Abdallah Maha, 2005, P IASTED PDCS

[2]

[Anonymous], 2015, Hadoop-The Definitive Guide: Storage and Analysis at Internet Scale

[3]

[Anonymous], 2006, P 2006 ACM SIGMOD IN

[4]

Aspnes J, 2003, SIAM PROC S, P384

[5]

Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137

[6] The Rainbow Skip Graph: A Fault-Tolerant Constant-Degree Distributed Data Structure [J].

Goodrich, Michael T. ;

Nelson, Michael J. ;

Sun, Jonathan Z. .

PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, :384-393

[7]

Jagadish H.V., 2005, Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), P661

[8] DYNAMIC INTERPOLATION SEARCH [J].

MEHLHORN, K ;

TSAKALIDIS, A .

JOURNAL OF THE ACM, 1993, 40 (03) :621-634

[9]

Pearl Y., 1978, COMMUN ACM, V21

[10] ADDRESSING FOR RANDOM-ACCESS STORAGE [J].

PETERSON, WW .

IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1957, 1 (02) :130-146

← 1 2 →