A Direct Approach for Word and Character Segmentation in Run-Length Compressed Documents with an Application to Word Spotting

被引：0

作者：

Javed, Mohammed ^{[1
]}

Nagabhushan, P. ^{[1
]}

Chaudhuri, B. B. ^{[2
]}

机构：

[1] Univ Mysore, Dept Studies Comp Sci, Mysore 570006, Karnataka, India

[2] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, Kolkata 700108, India

来源：

2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR) | 2015年

关键词：

Compressed text document segmentation; compressed word segmentation; compressed character segmentation; word spotting in compressed domain; IMAGES;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Segmentation of a text document into lines, words and characters is an important objective in application like OCR and related analytics. However in today's scenario, the documents are compressed for archival and transmission efficiency. Text segmentation in compressed documents warrants decompression, and needs additional computing resources. In this backdrop, the paper proposes a method for text segmentation directly in run-length compressed, printed English text documents. Line segmentation is done using the projection profile technique. Further segmentation into words and characters is accomplished by tracing the white runs along the base region of the text line. During the process, a run based region growing technique is applied in the spatial neighborhood of the white runs to trace the vertical space between the characters. After detecting the character spaces in the entire text line, the decision of word space and character space is made by computing the average character space. Subsequently based on the spatial position of the detected words and characters, their respective compressed segments are extracted. The proposed algorithm is tested with 1083 compressed text lines, and F-measure of 97.93% and 92.86% respectively for word and character segmentation are obtained. Finally an application of word spotting is also presented.

引用

页码：216 / 220

页数：5

共 18 条

[1] [Anonymous], IMAGE VIDEO PROCESSI
[2] Gowda S. D., 2007, INT C MOD SIM KOLK I, V1, P156
[3] Hull J. J., 1997, SPIE C DOC REC 4 FEB, P8
[4] Javed M., 2013, NCVPRIPG
[5] Javed M., 2015, INT J INFOR IN PRESS
[6] Javed M., 2014, IJCSIT, V5, P818
[7] Javed M., 2013, IJCA, V83, P1
[8] Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents
Javed, Mohammed
Nagabhushan, P.
Chaudhuri, B. B.
[J]. 2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 813 - 817
[9] Document image analysis: A primer
Kasturi, R
O'Gorman, L
Govindaraju, V
[J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2002, 27 (1): : 3 - 22
[10] Detecting duplicates among symbolically compressed images in a large document database
Lee, DS
Hull, JJ
[J]. PATTERN RECOGNITION LETTERS, 2001, 22 (05) : 545 - 550

← 1 2 →