Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering

被引:22
|
作者
Kadhim, Ammar Ismael [1 ,2 ]
Cheah, Yu-N [1 ]
Ahamed, Nurul Hashimah [1 ]
机构
[1] Univ Sains Malaysia, Sch Comp Sci, George Town, Malaysia
[2] Coll Med, Dept Comp Sci, Baghdad, Iraq
关键词
Text mining; retrieval information; clustering; singular value decomposition; dimension reduction; k-means;
D O I
10.1109/ICAIET.2014.21
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text mining defines generally the process of extracting interesting features (non-trivial) and knowledge from unstructured text documents. Text mining is an interdisciplinary field which depends on information retrieval, data mining, machine learning, parameter statistics and computational linguistics. Standard text mining and retrieval information techniques of text document usually rely on similar categories. An alternative method of retrieving information is clustering documents to preprocess text. The preprocessing steps have a huge effect on the success to extract knowledge. This study implements TF-IDF and singular value decomposition (SVD) dimensionality reduction techniques. The proposed system presents an effective preprocessing and dimensionality reduction techniques which help the document clustering by using k-means algorithm. Finally, the experimental results show that the proposed method enhances the performance of English text document clustering. Simulation results on BBC news and BBC sport datasets show the superiority of the proposed algorithm.
引用
收藏
页码:69 / 73
页数:5
相关论文
共 50 条
  • [1] Document representation and dimension reduction for text clustering
    Shafiei, Mahdi
    Wang, Singer
    Zhang, Roger
    Milios, Evangelos
    Tang, Bin
    Tougas, Jane
    Spiteri, Ray
    2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2, 2007, : 770 - 779
  • [2] Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Al-Betar, Mohammed Azmi
    Alomari, Osama Ahmad
    EXPERT SYSTEMS WITH APPLICATIONS, 2017, 84 : 24 - 36
  • [3] Comparing dimension reduction techniques for document clustering
    Tang, B
    Shepherd, M
    Heywood, MI
    Luo, X
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, 3501 : 292 - 296
  • [4] Text document clustering and the space of concept on text document automatically generated
    Fu, WP
    Wu, B
    He, Q
    Shi, ZZ
    2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : C107 - C112
  • [5] Text document clustering based on neighbors
    Luo, Congnan
    Li, Yanjun
    Chung, Soon M.
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (11) : 1271 - 1288
  • [6] Ontologies improve text document clustering
    Hotho, A
    Staab, S
    Stumme, G
    THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 541 - 544
  • [7] Text Document Clustering with Metric Learning
    Wang, Jinlong
    Wu, Shunyao
    Huy Quan Vu
    Li, Gang
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 783 - 784
  • [8] Text Document Clustering: The Application of Cluster Analysis to Textual Document
    2016, Institute of Electrical and Electronics Engineers Inc., United States
  • [9] Text Document Clustering: The Application of Cluster Analysis to Textual Document
    Reddy, Venkata Srikanth
    Kinnicutt, Patrick
    Lee, Roger
    2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE & COMPUTATIONAL INTELLIGENCE (CSCI), 2016, : 1174 - 1179
  • [10] An Apache Spark Implementation for Text Document Clustering
    Dritsas, Elias
    Trigka, Maria
    Vonitsanos, Gerasimos
    Kanavos, Andreas
    Mylonas, Phivos
    2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55