Interdependence of Text Mining Quality and the Input Data Preprocessing

被引:3
作者
Darena, Frantisek [1 ]
Zizka, Jan [1 ]
机构
[1] Mendel Univ Brno, Dept Informat, Fac Business & Econ, Zemedelska 1, Brno 61300, Czech Republic
来源
ARTIFICIAL INTELLIGENCE PERSPECTIVES AND APPLICATIONS (CSOC2015) | 2015年 / 347卷
关键词
Text mining; text data preprocessing; stemming; spell checking; stop words; support vector machine; decision tree; k-means;
D O I
10.1007/978-3-319-18476-0_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The paper focuses on preprocessing techniques application to short informal textual documents created in different natural languages. The goal is to evaluate the impact on the quality of the results and computational complexity of the text mining process designed to reveal knowledge hidden in the data. Extensive number of experiments were carried out with real world text data with correction of spelling errors, stemming, stop words removal, and their combinations applied. Support vector machine, decision trees, and k-means algorithms as the commonly used methods were considered to analyze the text data. The text mining quality was generally not influenced significantly, however, the positive impact represented by the decreased computational complexity was observed.
引用
收藏
页码:141 / 150
页数:10
相关论文
共 25 条
  • [1] [Anonymous], 2008, Introduction to information retrieval
  • [2] [Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
  • [3] [Anonymous], 2001, TECHNICAL REPORT
  • [4] [Anonymous], 2010, Text Mining: Applications and Theory
  • [5] [Anonymous], 2009, Clustering
  • [6] [Anonymous], 2001, Snowball: a language for stemming algorithms
  • [7] Carvalho G., 2007, Proceedings of the ACM First Ph.D. Workshop in CIKM, P125, DOI DOI 10.1145/1316874.1316894
  • [8] Text normalization in social media: progress, problems and applications for a pre-processing system of casual English
    Clark, Eleanor
    Araki, Kenji
    [J]. COMPUTATIONAL LINGUISTICS AND RELATED FIELDS, 2011, 27 : 2 - 11
  • [9] Evolving local and global weighting schemes in information retrieval
    Cummins, Ronan
    O'Riordan, Colm
    [J]. INFORMATION RETRIEVAL, 2006, 9 (03): : 311 - 330
  • [10] Feldman R., 2006, TEXT MINING HDB ADV