Approximate string matching techniques for effective CLIR among Indian languages

被引:0
作者
Makin, Ranbeer [1 ]
Pandey, Nikita [1 ]
Pingali, Prasad [1 ]
Varma, Vasudeva [1 ]
机构
[1] Int Inst Informat Technol, Hyderabad, Andhra Pradesh, India
来源
APPLICATIONS OF FUZZY SETS THEORY | 2007年 / 4578卷
关键词
telugu-hindi CLIR; Indian languages; cognate; identification;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.
引用
收藏
页码:430 / +
页数:2
相关论文
共 12 条
[1]  
ADRIANI M, 1997, IR170 CLIR
[2]  
HULL DA, 1996, P 19 ANN INT ACM SIG, P49
[3]   PROBABILISTIC LINKAGE OF LARGE PUBLIC-HEALTH DATA FILES [J].
JARO, MA .
STATISTICS IN MEDICINE, 1995, 14 (5-7) :491-498
[4]  
Koehn P, 2001, PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P27
[5]  
Mann GS, 2001, 2ND MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P151
[6]  
Manning CD, 2001, FDN STAT NATURAL LAN
[7]  
Melamed ID, 1999, COMPUT LINGUIST, V25, P107
[8]  
PINGALI P, 2006, WORKING NOTES CROSS
[9]  
PIRKOLA A., 2003, P 26 ANN INT ACM SIG, P345
[10]  
RADWAN K, 1995, 4 ANN S DOC AN INF R, P121