Indexing text documents based on topic identification

被引:0
作者
Butarbutar, M [1 ]
McRoy, S [1 ]
机构
[1] Univ Wisconsin, Dept Elect Engn & Comp Sci, Milwaukee, WI 53201 USA
来源
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2004年 / 3246卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work provides algorithms and heuristics to index text documents by determining important topics in the documents. To index text documents, the work provides algorithms to generate topic candidates, determine their importance, detect similar and synonym topics, and to eliminate incoherent topics. The indexing algorithm uses topic frequency to determine the importance and the existence of the topics. Repeated phrases are topic candidates. For example, since the phrase 'index text documents' occurs three times in this abstract, the phrase is one of the topics of this abstract. It is shown that this method is more effective than either a simple word count model or approaches based on term weighting.
引用
收藏
页码:113 / 124
页数:12
相关论文
共 15 条
  • [1] *AM HEART WI BLOOD, 1998, BLOOD PRESS MEAS ED
  • [2] ARONSON AR, P AMIA S 2000 S, P17
  • [3] Fagan J. L., 1987, Proceedings of the Tenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, P91, DOI 10.1145/42005.42016
  • [4] Harman D., 1992, INFORMATION RETRIEVA, P363
  • [5] JOHNSON DB, P AMIA S 1999, P814
  • [6] KAPLAN RM, 1996, SURVEY STATE ART HUM
  • [7] KELLEDY F, 1997, P 19 ANN BCS IRSG C
  • [8] LIN CY, 1997, THESIS U SO CALIFORN
  • [9] WORDNET - A LEXICAL DATABASE FOR ENGLISH
    MILLER, GA
    [J]. COMMUNICATIONS OF THE ACM, 1995, 38 (11) : 39 - 41
  • [10] Mitra Mandar., 1997, P 5 INT RIAO C, P200