Latent semantic analysis for text categorization using neural network

被引:74
作者
Yu, Bo [1 ]
Xu, Zong-ben [2 ]
Li, Cheng-hua [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China
[2] Xi An Jiao Tong Univ, Sch Sci, Inst Informat & Syst Sci, Xian 710049, Peoples R China
[3] Chonbuk Natl Univ, Dept Informat & Commun Engn, Chonju 561756, South Korea
基金
中国国家自然科学基金;
关键词
Latent semantic analysis; Neural network; Text categorization;
D O I
10.1016/j.knosys.2008.03.045
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
New text categorization models using back-propagation neural network (BPNN) and modified back-propagation neural network (MBPNN) are proposed. An efficient feature selection method is used to reduce the dimensionality as well as improve the performance. The basic BPNN learning algorithm has the drawback of slow training speed, so we modify the basic BPNN learning algorithm to accelerate the training speed. The categorization accuracy also has been improved consequently. Traditional word-matching based text categorization system uses vector space model (VSM) to represent the document. However, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which can also lead to poor classification accuracy. Latent semantic analysis (LSA) can overcome the problems caused by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. It not only greatly reduces the dimensionality but also discovers the important associative relationship between terms. We test our categorization models on 20-newsgroup data set, experimental results show that the models using MBPNN outperform than the basic BPNN. And the application of LSA for our system can lead to dramatic dimensionality reduction while achieving good classification results. (C) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:900 / 904
页数:5
相关论文
共 21 条
[1]   Using linear algebra for intelligent information retrieval [J].
Berry, MW ;
Dumais, ST ;
OBrien, GW .
SIAM REVIEW, 1995, 37 (04) :573-595
[2]  
CALVO AR, 2004, J DIGITAL INFORM, V5
[3]   Fast and accurate text classification via multiple linear discriminant projections [J].
Chakrabarti, S ;
Roy, S ;
Soundalgekar, MV .
VLDB JOURNAL, 2003, 12 (02) :170-185
[4]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[5]  
2-9
[6]  
GABRILOVICH E, 2004, ICML 04, P321
[7]  
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
[8]   Maximum entropy models with inequality constraints: A case study on text categorization [J].
Kazama, J ;
Tsujii, J .
MACHINE LEARNING, 2005, 60 (1-3) :159-194
[9]   Feature reduction for neural network based text categorization [J].
Lam, SLY ;
Lee, DL .
6TH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 1999, :195-+
[10]  
[李荣陆 Li Ronglu], 2005, [计算机研究与发展, Journal of Computer Research and Development], V42, P94, DOI 10.1360/crad20050113