The stringdist Package for Approximate String Matching

被引:2
作者
van der Loo, Mark P. J.
机构
来源
R JOURNAL | 2014年 / 6卷 / 01期
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R's native exact matching functions match and % in%.
引用
收藏
页码:111 / 122
页数:12
相关论文
共 28 条
[1]  
[Anonymous], 1990, String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage
[2]  
[Anonymous], 2003, IIWeb
[3]  
Borg A., 2012, RECORDLINKAGE RECORD
[4]  
Boytsov L., 2011, ACM J EXPT ALGORITHM, V16, P1
[5]  
Buchta C., 2013, CBA CLUSTERING BUSIN
[6]  
Butts C.T., 2013, sna: Tools for social network analysis
[7]  
Cavnar W., 1994, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, V3, P161
[8]  
Doran H., 2010, MISCPSYCHO MISCALLEA
[9]  
Grimonprez Q., 2013, RANKCLUSTER MODEL BA
[10]   ERROR DETECTING AND ERROR CORRECTING CODES [J].
HAMMING, RW .
BELL SYSTEM TECHNICAL JOURNAL, 1950, 29 (02) :147-160