Distributed Classification of Text Documents on Apache Spark Platform

被引:12
|
作者
Semberecki, Piotr [1 ]
Maciejewski, Henryk [1 ]
机构
[1] Wroclaw Univ Technol, Dept Comp Engn, Wybrzeze Wyspianskiego 27, PL-50270 Wroclaw, Poland
关键词
Text subject classification; Natural Language Processing (NLP); Machine learning; Apache Spark;
D O I
10.1007/978-3-319-39378-0_53
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English-language Wikipedia.
引用
收藏
页码:621 / 630
页数:10
相关论文
共 50 条
  • [1] Fast Text Classification with Naive Bayes Method on Apache Spark
    Ogul, Iskender Ulgen
    Ozcan, Caner
    Hakdagli, Ozlem
    2017 25TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2017,
  • [2] Distributed boosting algorithm for classification of text documents
    Sarnovsky, Martin
    Vronc, Michal
    2014 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS (SAMI), 2014, : 216 - 219
  • [3] Parallel and Distributed Implementation of Sine Cosine Algorithm on Apache Spark Platform
    Alfailakawi, Mohammad Gh.
    Aljame, Maryam
    Ahmad, Imtiaz
    IEEE ACCESS, 2021, 9 : 77188 - 77202
  • [4] Performance Prediction for Apache Spark Platform
    Wang, Kewen
    Khan, Mohammad Maifi Hasan
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 166 - 173
  • [5] An Apache Spark Implementation for Text Document Clustering
    Dritsas, Elias
    Trigka, Maria
    Vonitsanos, Gerasimos
    Kanavos, Andreas
    Mylonas, Phivos
    2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55
  • [6] Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark
    Alexopoulos, Athanasios
    Drakopoulos, Georgios
    Kanavos, Andreas
    Mylonas, Phivos
    Vonitsanos, Gerasimos
    ALGORITHMS, 2020, 13 (03)
  • [8] Classification of text documents
    Li, YH
    Jain, AK
    FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 1295 - 1297
  • [9] Classification of text documents
    Li, YH
    Jain, AK
    COMPUTER JOURNAL, 1998, 41 (08): : 537 - 546
  • [10] Efficient distributed SPARQL queries on Apache Spark
    Albahli S.
    International Journal of Advanced Computer Science and Applications, 2019, 10 (08): : 564 - 568