Part-of-speech tagging for table of contents recognition

被引:0
作者
Belaïd, A [1 ]
Pierron, L [1 ]
Valverde, N [1 ]
机构
[1] CNRS, LORIA, F-54506 Vandoeuvre Nancy, France
来源
15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 4, PROCEEDINGS: APPLICATIONS, ROBOTICS SYSTEMS AND ARCHITECTURES | 2000年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A labeling approach to automatic recognition of tables of contents (TOC)s is described. A prototype is used for consulting electronically scientific papers in a library system named Calliope. This method operates on an roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech (POS) tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in title and author strings and reduced in canonical forms according to contextual rules. Non labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction.
引用
收藏
页码:451 / 454
页数:4
相关论文
共 5 条
[1]  
BRILL E, 1992, P ANLP
[2]  
OGORMAN L, 1992, 11TH P IAPR INT C PA, V2, P260
[3]  
STORY GA, 1992, COMPUTER SEP
[4]  
TAKASU A, ICDAR 95, V1, P239
[5]  
TAKASU A, 1994, DOCUMENT UNDERSTANDI, P463