Processing large-scale data with Apache Spark

被引:1
|
作者
Ko, Seyoon [1 ]
Won, Joong-Ho [1 ]
机构
[1] Seoul Natl Univ, Dept Stat, 1 Gwanak Ro, Seoul 08826, South Korea
基金
新加坡国家研究基金会;
关键词
Spark; machine learning; cluster computing; parallel computing;
D O I
10.5351/KJAS.2016.29.6.1077
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Apache Spark is a fast and general-purpose cluster computing package. It provides a new abstraction named resilient distributed dataset, which is capable of support for fault tolerance while keeping data in memory. This type of abstraction results in a significant speedup compared to legacy large-scale data framework, MapReduce. In particular, Spark framework is suitable for iterative machine learning applications such as logistic regression and K-means clustering, and interactive data querying. Spark also supports high level libraries for various applications such as machine learning, streaming data processing, database querying and graph data mining thanks to its versatility. In this work, we introduce the concept and programming model of Spark as well as show some implementations of simple statistical computing applications. We also review the machine learning package MLlib, and the R language interface SparkR.
引用
收藏
页码:1077 / 1094
页数:18
相关论文
共 50 条
  • [1] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [2] Large-scale text processing pipeline with Apache Spark
    Svyatkovskiy, A.
    Imai, K.
    Kroeger, M.
    Shiraito, Y.
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
  • [3] Filter Large-scale Engine Data using Apache Spark
    Pirozzi, Donato
    Scarano, Vittorio
    Begg, Steven
    De Sercey, Guillaume
    Fish, Andrew
    Harvey, Andrew
    2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
  • [4] Large-Scale Network Embedding in Apache Spark
    Lin, Wenqing
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
  • [5] A Large-Scale Sentiment Data Classification for Online Reviews Under Apache Spark
    Al-Saqqa, Samar
    Al-Naymat, Ghazi
    Awajan, Arafat
    9TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN-2018) / 8TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2018), 2018, 141 : 183 - 189
  • [6] Supervised Papers Classification on Large-Scale High-Dimensional Data with Apache Spark
    Akritidis, Leonidas
    Bozanis, Panayiotis
    Fevgas, Athanasios
    2018 16TH IEEE INT CONF ON DEPENDABLE, AUTONOM AND SECURE COMP, 16TH IEEE INT CONF ON PERVAS INTELLIGENCE AND COMP, 4TH IEEE INT CONF ON BIG DATA INTELLIGENCE AND COMP, 3RD IEEE CYBER SCI AND TECHNOL CONGRESS (DASC/PICOM/DATACOM/CYBERSCITECH), 2018, : 987 - 994
  • [7] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [8] Particle Swarm Optimization for Large-Scale Clustering on Apache Spark
    Sherar, Matthew
    Zulkernine, Farhana
    2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 801 - 808
  • [9] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    ACM/IMS Transactions on Data Science, 2020, 1 (03):
  • [10] Optimizing Apache Spark MLlib: Predictive Performance of Large-Scale Models for Big Data Analytics
    Theodorakopoulos, Leonidas
    Karras, Aristeidis
    Krimpas, George A.
    ALGORITHMS, 2025, 18 (02)