Imaged document text retrieval without OCR

被引:1
作者
Tan, CL [1 ]
Huang, WH
Yu, ZH
Xu, Y
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 117453, Singapore
[2] Agilent Technol Singapore Pte Ltd, Singapore 119967, Singapore
关键词
document image analysis; document vector; text similarity; text retrieval;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted, An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method.
引用
收藏
页码:838 / 844
页数:7
相关论文
共 15 条
[1]   Duplicate document detection by template matching [J].
Caprari, RS .
IMAGE AND VISION COMPUTING, 2000, 18 (08) :633-643
[2]   Detection and location of multicharacter sequences in lines of imaged text [J].
Chen, FR ;
Bloomberg, DS ;
Wilcox, LD .
JOURNAL OF ELECTRONIC IMAGING, 1996, 5 (01) :37-49
[3]  
CHEN FR, 1997, P 4 INT C DOC AN REC, V1, P227
[4]  
CROFT WB, 1994, P 3 ANN S DOC AN INF, P115
[5]   GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT [J].
DAMASHEK, M .
SCIENCE, 1995, 267 (5199) :843-848
[6]  
HARDING SM, 1997, P 1 EUR C RES ADV TE, P345
[7]  
HARMAN D, 1995, SCIENCE, V268, P1417
[8]  
HE Y, 1999, P 5 INT C DOC AN REC, P685
[9]  
HULL JJ, 1997, P 4 INT C DOC AN REC, V1, P308
[10]  
LEE DS, 1999, P 5 INT C DOC AN REC, P305