GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT

被引:314
作者
DAMASHEK, M
机构
[1] Department of Defense, Fort George G. Meade
关键词
D O I
10.1126/science.267.5199.843
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.
引用
收藏
页码:843 / 848
页数:6
相关论文
共 31 条
  • [1] AUTOMATIC SPELLING CORRECTION USING A TRIGRAM SIMILARITY MEASURE
    ANGELL, RC
    FREUND, GE
    WILLETT, P
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1983, 19 (04) : 255 - 261
  • [2] CAVNAR WB, 1994, NIST500215 NAT I STA, P171
  • [3] CAVNAR WB, 1994, 1994 P S DOC AN INF, P161
  • [4] COHEN JD, 1995, J AM SOC INFORM SCI, V46, P162, DOI 10.1002/(SICI)1097-4571(199504)46:3<162::AID-ASI2>3.0.CO
  • [5] 2-6
  • [6] COHEN JD, Patent No. 2694984
  • [7] COHEN JI, UNPUB
  • [8] HARMAN DK, UNPUB 3RD TEXT RETR
  • [9] HARMAN DK, 1994, NIST500215 NAT I STA
  • [10] HUFFMAN SM, UNPUB