A semi-automatic system for tagging specialized corpora

被引:0
作者
Amrani, A
Kodratoff, Y
Matte-Tailliez, O
机构
[1] ESIEA Rech, F-75005 Paris, France
[2] Univ Paris 11, CNRS, UMR 8623, LRI, F-91405 Orsay, France
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS | 2004年 / 3056卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we treat the problem of the grammatical tagging of non-annotated corpora of specialty. The existing taggers are trained on general language corpora, and give inconsistent results on the specialized texts, as technical and scientific ones. In order to learn rules adapted to a specialized field, the usual approach labels manually a large corpus of this field. This is extremely time-consuming. We propose here a semi-automatic approach for tagging corpora of specialty. ETIQ, the new tagger we are building, make it possible to correct the base of rules obtained by Brill's tagger and to adapt it to a corpus of specialty. The user visualizes an initial and basic tagging and corrects it either by extending Brill's lexicon or by the insertion of specialized lexical and contextual rules. The inserted rules are richer and more flexible than Brill's ones. To help the expert in this task, we designed an inductive algorithm biased by the "correct" knowledge he acquired beforehand. By using techniques of machine learning and enabling the expert to incorporate knowledge of the field in an interactive and friendly way, we improve the tagging of specialized corpora. Our approach has been applied to a corpus of molecular biology.
引用
收藏
页码:670 / 681
页数:12
相关论文
共 18 条
[1]  
Brants Thorsten, 2000, P 6 C APPL NAT LANG
[2]  
BRILL E, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P722
[3]  
BRILL E, 1998, P 36 ANN M ASS COMPU
[4]  
CUSSENS J, 1997, P 7 INT WORKSH IND L, V1297, P93
[5]  
CUTTING D, 1992, P 3 C APPL NATURAL L
[6]  
DAELEMANS W, 1996, P 4 WORKSH VER LARG, P14
[7]  
EINEBORG M, 2000, LEARNING LANGUAGE LO, V1925
[8]  
EINEBORG M, 2000, P 36 ANN M ASS COMP, V1295
[9]  
HALTEREN V, 1999, SYNTACTIC WORLDCLASS, pCH15
[10]  
KODRATOFF Y, 2003, RNTI, V1, P171