Segmented document classification: Problem and solution

被引:0
作者
Guo, Hang [1 ]
Zhou, Lizhu [1 ]
机构
[1] Tsinghua Univ, Comp Sci & Technol Dept, Beijing 100084, Peoples R China
来源
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS | 2006年 / 4080卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like "title", "body", etc. We call them "segmented documents". To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN-MIX and OUT-MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.
引用
收藏
页码:538 / 548
页数:11
相关论文
共 16 条
[1]  
ANDREW M, 1998, AAAI 98 WORKSH LEARN
[2]  
ASAI T, 2002, INT C DAT MIN ICDM 0
[3]  
BAUNER E, 1999, MACH LEARN, V36, P105
[4]  
BRIGHTON H, 2002, C DAT MIN KNOWL DISC
[5]  
Dietterich T., 2000, MACHINE LEARNING
[6]   Ensemble methods in machine learning [J].
Dietterich, TG .
MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 :1-15
[7]  
FABRIZIO S, 2002, ACM COMPUTING SURVEY, V34
[8]   NEURAL NETWORK ENSEMBLES [J].
HANSEN, LK ;
SALAMON, P .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1990, 12 (10) :993-1001
[9]  
Joachims T., 1998, 10 EUR C MACH LEARN
[10]   Improvements to Platt's SMO algorithm for SVM classifier design [J].
Keerthi, SS ;
Shevade, SK ;
Bhattacharyya, C ;
Murthy, KRK .
NEURAL COMPUTATION, 2001, 13 (03) :637-649