Improving OCR text categorization accuracy with electronic abstracts

被引:0
|
作者
Li, Linlin [1 ]
Tan, Chew Lim [1 ]
机构
[1] Natl Univ Singapore, Sch Comp, Dept Comp Sci, Singapore 117543, Singapore
来源
SECOND INTERNATIONAL CONFERENCE ON DOCUMENT IMAGE ANALYSIS FOR LIBRARIES, PROCEEDINGS | 2006年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Categorization of imaged documents is a useful technique for building document image based digital libraries. This paper investigates techniques to improve categorization accuracy on OCR text, particularly that of biomedical imaged documents. Experiments with different feature selection methods were run to explore their effect on the categorization performance. The result shows that document frequency is a good feature selection method in terms of eliminating OCR errors. Furthermore, our categorization scheme IMP that combines OCR text and electronic abstracts shows consistent improvement on the accuracy as compared to categorizing on either abstracts or OCR text alone.
引用
收藏
页码:82 / +
页数:2
相关论文
共 50 条
  • [1] Evaluating text categorization in the presence of OCR errors
    Taghva, K
    Nartker, T
    Borsack, J
    Lumos, S
    Condit, A
    Young, R
    DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 68 - 74
  • [2] Image preprocessing for improving OCR accuracy
    Bieniecki, Wojciech
    Grabowski, Szymon
    Rozenberg, Wojciech
    PERSPECTIVE TECHNOLOGIES AND METHODS IN MEMS DESIGN, 2007, : 75 - +
  • [3] The impact of OCR accuracy on automatic text classification
    Zu, GW
    Murata, M
    Ohyama, W
    Wakabayashi, T
    Kimura, F
    CONTENT COMPUTING, PROCEEDINGS, 2004, 3309 : 403 - 409
  • [4] Do thesauri enhance rule-based categorization for OCR text?
    Taghva, K
    Coombs, J
    DOCUMENT RECOGNITION AND RETRIEVAL X, 2003, 5010 : 111 - 119
  • [5] Towards Improving the Accuracy of Telugu OCR Systems
    Kumar, P. Pavan
    Bhagvati, Chakravarthy
    Negi, Atul
    Agarwal, Arun
    Deekshatulu, B. L.
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 910 - 914
  • [6] Improving OCR Accuracy for Classical Critical Editions
    Boschetti, Federico
    Romanello, Matteo
    Babeu, Alison
    Bamman, David
    Crane, Gregory
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, PROCEEDINGS, 2009, 5714 : 156 - 167
  • [7] Improving OCR accuracy through combination: A survey
    Handley, JC
    1998 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5, 1998, : 4330 - 4333
  • [8] A high accuracy OCR system for printed Telugu text
    Lakshmi, CV
    Patvardhan, C
    IEEE TENCON 2003: CONFERENCE ON CONVERGENT TECHNOLOGIES FOR THE ASIA-PACIFIC REGION, VOLS 1-4, 2003, : 725 - 729
  • [9] Improving linear classifier for Chinese text categorization
    Tsay, JJ
    Wang, JD
    INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (02) : 223 - 237
  • [10] Improving text categorization using the importance of sentences
    Ko, Y
    Park, J
    Seo, J
    INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (01) : 65 - 79