Towards enriching the quality of k-nearest neighbor rule for document classification

被引:15
作者
Basu, Tanmay [1 ]
Murthy, C. A. [1 ]
机构
[1] Indian Stat Inst, Machine Intelligence Unit, Kolkata, India
关键词
k-nearest neighbor; Text classification;
D O I
10.1007/s13042-013-0177-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods.
引用
收藏
页码:897 / 905
页数:9
相关论文
共 31 条
  • [1] [Anonymous], 2008, Introduction to information retrieval
  • [2] [Anonymous], 1973, Pattern Classification and Scene Analysis
  • [3] BAILEY T, 1978, IEEE T SYST MAN CYB, V8, P311
  • [4] Document categorization and query generation on the World Wide Web using WebACE
    Boley, D
    Gini, M
    Gross, R
    Han, EH
    Hastings, K
    Karypis, G
    Kumar, V
    Mobasher, B
    Moore, J
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 1999, 13 (5-6) : 365 - 391
  • [5] NEAREST NEIGHBOR PATTERN CLASSIFICATION
    COVER, TM
    HART, PE
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +
  • [6] Dasarathy B., 1991, MCGRAW HILL COMPUTER
  • [7] Dasarathy B. V., 1977, Proceedings of the International Conference on Cybernetics and Society, P630
  • [8] Dhurandhar A, 2012, INT J MACH LEARN CYB
  • [9] DUDANI SA, 1976, IEEE T SYST MAN CYB, V6, P327
  • [10] Fix E., 1951, TECHNICAL REPORT REP, P261