Lexical Text Segmentation Using Dictionaries

被引:0
作者
Chawathe, Sudarshan S. [1 ]
机构
[1] Univ Maine, Sch Comp & Informat Sci, Orono, ME 04469 USA
来源
2018 IEEE 8TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC) | 2018年
基金
美国国家科学基金会;
关键词
word segmentation; text processing; dynamic programming; string matching;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text segmentation refers to the task of partitioning text into disjoint segments based on some matching and optimization criteria. Examples include partitioning text into words, graphemes, and phonemes. The problem is especially challenging when the language does not require spaces, punctuation, or other simple separators; when segments may be combined in nontrivial ways; and in the presence of errors in transcription or recognition. This paper focuses on a purely lexical method of segmentation: Text is segmented using only a dictionary of known words along with a compatible cost function. No grammatical or other higher-level knowledge is used. The method uses efficient algorithms for multiple-string matching, such as the classic Aho-Corasick algorithm, to yield significant improvements in running time when compared with a simpler dynamic programming algorithm. An experimental study compares the running times of the dictionary-based and dynamic programming algorithms.
引用
收藏
页码:47 / 53
页数:7
相关论文
共 17 条
[1]   EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH [J].
AHO, AV ;
CORASICK, MJ .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :333-340
[2]  
[Anonymous], 2017, THE FOOLISH TORTOISE
[3]  
Bor R., 2017, JAVA IMPLEMENTATION
[4]  
Bothner Per., 2017, KAWA SCHEME LANGUAGE
[5]  
Coleridge S. T., 2006, PROJECT GUTENBERG
[6]  
Commentz-Walter B., 1979, Automata, Languages and Programming, P118
[7]  
Cormen T. H., 2009, Introduction to Algorithms, V3rd
[8]  
Huet G., 2017, SANSKRIT READER COMP
[9]  
Janert P. K., 2010, GNUPLOT IN ACTION
[10]  
Jiang W., 2015, WORDSEGMENTATION