An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts

被引:70
作者
Wilbur, WJ [1 ]
Yang, YM [1 ]
机构
[1] MAYO CLIN & MAYO FDN,SECT MED INFORMAT RESOURCES,ROCHESTER,MN 55905
关键词
molecular biology; stop terms; text retrieval; text classification; linear least squares fit; weight; strength; vector model; Bayesian model;
D O I
10.1016/0010-4825(95)00055-0
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The biological literature presents a difficult challenge to information processing in its complexity, diversity, and in its sheer volume. Much of the diversity resides in its technical terminology, which has also become voluminous. In an effort to deal more effectively with this large vocabulary and improve information processing, a method of focus has been developed which allows one to classify terms based on a measure of their importance in describing the content of the documents in which they occur. The measurement is called the strength of a term and is a measure of how strongly the term's occurrences correlate with the subjects of documents in the database. If term occurrences are random then there will be no correlation and the strength will be zero, but if for any subject, the term is either always present or never present its strength will be one. We give here a new, information theoretical interpretation of term strength, review some of its uses in focusing the processing of documents for information retrieval and describe new results obtained in document categorization. Copyright (C) 1996 Elsevier Science Ltd.
引用
收藏
页码:209 / 222
页数:14
相关论文
共 31 条
[1]  
BARKLA JK, 1969, UNPUB CONSTRUCTION W
[2]  
Buckley Chris, 1985, 85686 CORN U DEP COM
[3]  
COOPER WS, 1991, P 54 ANN M AM SOC IN, V28, P366
[4]  
CROFT WB, 1982, 8221 COINS U MASS
[5]  
FANRIJSBEWRGEN CJ, 1979, INFORMATION RETRIEVA
[6]   THE VOCABULARY PROBLEM IN HUMAN SYSTEM COMMUNICATION [J].
FURNAS, GW ;
LANDAUER, TK ;
GOMEZ, LM ;
DUMAIS, ST .
COMMUNICATIONS OF THE ACM, 1987, 30 (11) :964-971
[7]  
HARMAN D, 1994, 2 TEXT RETR C TREC 2, P1
[8]   USE OF MEDLINE BY PHYSICIANS FOR CLINICAL PROBLEM-SOLVING [J].
LINDBERG, DAB ;
SIEGEL, ER ;
RAPP, BA ;
WALLINGFORD, KT ;
WILSON, SR .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 1993, 269 (24) :3124-3129
[9]  
LUCARELLA D, 1988, J INFORM SCI, V14, P25, DOI 10.1177/016555158801400104
[10]   PROBABILISTIC SEARCH STRATEGY FOR MEDLARS [J].
MILLER, WL .
JOURNAL OF DOCUMENTATION, 1971, 27 (04) :254-&