Finding keywords amongst noise: Automatic text classification without parsing

被引:5
作者
Allison, Andrew G. [1 ]
Pearce, Charles E. M. [2 ]
Abbott, Derek [1 ]
机构
[1] Univ Adelaide, Ctr Biomed Engn, Adelaide, SA 5005, Australia
[2] Univ Adelaide, Sch Mathemat Sci, Adelaide, SA 5005, Australia
来源
NOISE AND STOCHASTICS IN COMPLEX SYSTEMS AND FINANCE | 2007年 / 6601卷
关键词
keywords; word recurrence interval; finite mixture distributions; mixed Poisson process; maximum likelihood; Kolmogorov-Smirnov;
D O I
10.1117/12.724655
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
The amount of text stored on the Internet, and in our libraries, continues to expand at an exponential rate. There is a great practical need to locate relevant content. This requires quick automated methods for classifying textual information, according to subject. We propose a quick statistical approach, which can distinguish between 'keywords' and 'noisewords", like 'the' and 'a', without the need to parse the text into its parts of speech. Our classification is based on an F-statistic, which compares the observed Word Recurrence Interval (WRI) with a simple null hypothesis. We also propose a model to account for the observed distribution of WRI statistics and we subject this model to a number of tests.
引用
收藏
页数:12
相关论文
共 29 条
[1]  
ABBOTT D, 2000, P 2 INT C UNS PROBL, V511
[2]  
Abramowitz M., 1970, HDB MATH FUNCTIONS
[3]   A 2-STATE MARKOV MIXTURE MODEL FOR A TIME-SERIES OF EPILEPTIC SEIZURE COUNTS [J].
ALBERT, PS .
BIOMETRICS, 1991, 47 (04) :1371-1381
[4]  
[Anonymous], 2002, OXFORD DICT STAT
[5]  
[Anonymous], 2012, Probability Theory: The Logic Of Science
[6]   Statistical techniques for text classification based on word recurrence intervals [J].
Berryman, MJ ;
Allison, A ;
Abbott, D .
FLUCTUATION AND NOISE LETTERS, 2003, 3 (01) :L1-L10
[7]   Signal processing and statistical methods in analysis of text and DNA [J].
Berryman, MJ ;
Allison, A ;
Carpena, P ;
Abbott, D .
BIOMEDICAL APPLICATIONS OF MICRO- AND NANOENGINEERING, 2002, 4937 :231-240
[8]  
CHALMERS AF, 1999, WHAT THIS THING CALL
[9]  
CONTE SD, ELEMENTARY NUMERICAL
[10]  
GROEBNER D, 1985, BUSINESS STAT