Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering

被引：22

作者：

Kadhim, Ammar Ismael ^{[1
,2
]}

Cheah, Yu-N ^{[1
]}

Ahamed, Nurul Hashimah ^{[1
]}

机构：

[1] Univ Sains Malaysia, Sch Comp Sci, George Town, Malaysia

[2] Coll Med, Dept Comp Sci, Baghdad, Iraq

来源：

PROCEEDINGS 2014 4TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE WITH APPLICATIONS IN ENGINEERING AND TECHNOLOGY ICAIET 2014 | 2014年

关键词：

Text mining; retrieval information; clustering; singular value decomposition; dimension reduction; k-means;

D O I：

10.1109/ICAIET.2014.21

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text mining defines generally the process of extracting interesting features (non-trivial) and knowledge from unstructured text documents. Text mining is an interdisciplinary field which depends on information retrieval, data mining, machine learning, parameter statistics and computational linguistics. Standard text mining and retrieval information techniques of text document usually rely on similar categories. An alternative method of retrieving information is clustering documents to preprocess text. The preprocessing steps have a huge effect on the success to extract knowledge. This study implements TF-IDF and singular value decomposition (SVD) dimensionality reduction techniques. The proposed system presents an effective preprocessing and dimensionality reduction techniques which help the document clustering by using k-means algorithm. Finally, the experimental results show that the proposed method enhances the performance of English text document clustering. Simulation results on BBC news and BBC sport datasets show the superiority of the proposed algorithm.

引用

页码：69 / 73

页数：5

共 50 条

[1] Document representation and dimension reduction for text clustering
Shafiei, Mahdi
Wang, Singer
Zhang, Roger
Milios, Evangelos
Tang, Bin
Tougas, Jane
Spiteri, Ray
2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2, 2007, : 770 - 779
[2] Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering
Abualigah, Laith Mohammad
Khader, Ahamad Tajudin
Al-Betar, Mohammed Azmi
Alomari, Osama Ahmad
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 84 : 24 - 36
[3] Comparing dimension reduction techniques for document clustering
Tang, B
Shepherd, M
Heywood, MI
Luo, X
ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, 3501 : 292 - 296
[4] Text document clustering and the space of concept on text document automatically generated
Fu, WP
Wu, B
He, Q
Shi, ZZ
2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : C107 - C112
[5] Text document clustering based on neighbors
Luo, Congnan
Li, Yanjun
Chung, Soon M.
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (11) : 1271 - 1288
[6] Ontologies improve text document clustering
Hotho, A
Staab, S
Stumme, G
THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 541 - 544
[7] Text Document Clustering with Metric Learning
Wang, Jinlong
Wu, Shunyao
Huy Quan Vu
Li, Gang
SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 783 - 784
[8] Text Document Clustering: The Application of Cluster Analysis to Textual Document
2016, Institute of Electrical and Electronics Engineers Inc., United States
[9] Text Document Clustering: The Application of Cluster Analysis to Textual Document
Reddy, Venkata Srikanth
Kinnicutt, Patrick
Lee, Roger
2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE & COMPUTATIONAL INTELLIGENCE (CSCI), 2016, : 1174 - 1179
[10] An Apache Spark Implementation for Text Document Clustering
Dritsas, Elias
Trigka, Maria
Vonitsanos, Gerasimos
Kanavos, Andreas
Mylonas, Phivos
2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55

← 1 2 3 4 5 →