Processing large-scale data with Apache Spark

被引：1

作者：

Ko, Seyoon ^{[1
]}

Won, Joong-Ho ^{[1
]}

机构：

[1] Seoul Natl Univ, Dept Stat, 1 Gwanak Ro, Seoul 08826, South Korea

来源：

KOREAN JOURNAL OF APPLIED STATISTICS | 2016年 / 29卷 / 06期

基金：

新加坡国家研究基金会;

关键词：

Spark; machine learning; cluster computing; parallel computing;

D O I：

10.5351/KJAS.2016.29.6.1077

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Apache Spark is a fast and general-purpose cluster computing package. It provides a new abstraction named resilient distributed dataset, which is capable of support for fault tolerance while keeping data in memory. This type of abstraction results in a significant speedup compared to legacy large-scale data framework, MapReduce. In particular, Spark framework is suitable for iterative machine learning applications such as logistic regression and K-means clustering, and interactive data querying. Spark also supports high level libraries for various applications such as machine learning, streaming data processing, database querying and graph data mining thanks to its versatility. In this work, we introduce the concept and programming model of Spark as well as show some implementations of simple statistical computing applications. We also review the machine learning package MLlib, and the R language interface SparkR.

引用

页码：1077 / 1094

页数：18

共 50 条

[1] Large-Scale Data Pollution with Apache Spark
Hildebrandt, Kai
Panse, Fabian
Wilcke, Niklas
Ritter, Norbert
IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
[2] Large-scale text processing pipeline with Apache Spark
Svyatkovskiy, A.
Imai, K.
Kroeger, M.
Shiraito, Y.
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
[3] Filter Large-scale Engine Data using Apache Spark
Pirozzi, Donato
Scarano, Vittorio
Begg, Steven
De Sercey, Guillaume
Fish, Andrew
Harvey, Andrew
2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
[4] Large-Scale Network Embedding in Apache Spark
Lin, Wenqing
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
[5] A Large-Scale Sentiment Data Classification for Online Reviews Under Apache Spark
Al-Saqqa, Samar
Al-Naymat, Ghazi
Awajan, Arafat
9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 183 - 189
[6] Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark
Akritidis, Leonidas
Bozanis, Panayiotis
Fevgas, Athanasios
2018 16TH IEEE INT CONF ON DEPENDABLE, AUTONOM AND SECURE COMP, 16TH IEEE INT CONF ON PERVAS INTELLIGENCE AND COMP, 4TH IEEE INT CONF ON BIG DATA INTELLIGENCE AND COMP, 3RD IEEE CYBER SCI AND TECHNOL CONGRESS (DASC/PICOM/DATACOM/CYBERSCITECH), 2018, : 987 - 994
[7] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
[8] Particle Swarm Optimization for Large-Scale Clustering on Apache Spark
Sherar, Matthew
Zulkernine, Farhana
2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 801 - 808
[9] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
ACM/IMS Transactions on Data Science, 2020, 1 (03):
[10] Optimizing Apache Spark MLlib: Predictive Performance of Large-Scale Models for Big Data Analytics
Theodorakopoulos, Leonidas
Karras, Aristeidis
Krimpas, George A.
ALGORITHMS, 2025, 18 (02)

← 1 2 3 4 5 →