Boosting text segmentation via progressive classification

被引:17
作者
Cesario, Eugenio [1 ]
Folino, Francesco [1 ]
Locane, Antonio [1 ]
Manco, Giuseppe [1 ]
Ortale, Riccardo [1 ]
机构
[1] CNR, ICAR CNR, Inst High Performance Comp & Networks, I-87036 Arcavacata Di Rende, Italy
关键词
schema reconciliation; text segmentation; classification;
D O I
10.1007/s10115-007-0085-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel approach for reconciling tuples stored as free text into an existing attribute schema is proposed. The basic idea is to subject the available text to progressive classification, i.e., a multi-stage classification scheme where, at each intermediate stage, a classifier is learnt that analyzes the textual fragments not reconciled at the end of the previous steps. Classification is accomplished by an ad hoc exploitation of traditional association mining algorithms, and is supported by a data transformation scheme which takes advantage of domain-specific dictionaries/ontologies. A key feature is the capability of progressively enriching the available ontology with the results of the previous stages of classification, thus significantly improving the overall classification accuracy. An extensive experimental evaluation shows the effectiveness of our approach.
引用
收藏
页码:285 / 320
页数:36
相关论文
共 22 条
[1]  
Adelberg Brad, 1998, SIGMOD, 1998, P283, DOI [10.1145/276304.276330, DOI 10.1145/276304.276330]
[2]  
Agichtein Eugene., 2004, P ACM SIGKDD INT C K, P20
[3]  
BORKAR VR, 2001, P ACM SIGMOD INT C M, P175
[4]  
Brill E, 1995, COMPUT LINGUIST, V21, P543
[5]  
Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
[6]  
COHEN WW, 1995, P 5 INT WORKSH IND L, P3
[7]   Duplicate record detection: A survey [J].
Elmagarmid, Ahmed K. ;
Ipeirotis, Panagiotis G. ;
Verykios, Vassilios S. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) :1-16
[8]  
Flesca S, 2004, AI COMMUN, V17, P57
[9]  
Gu L, 2003, RECORD LINKAGE CURRE
[10]   Real-world data is dirty: Data cleansing and the merge/purge problem [J].
Hernandez, MA ;
Stolfo, SJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (01) :9-37