Automatic extraction of titles from general documents using machine learning

被引:21
作者
Hu, Yunhua
Li, Hang
Cao, Yunbo
Teng, Li
Meyerzon, Dmitriy
Zheng, Qinghua
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci, Xian 710049, Shaanxi, Peoples R China
[2] Microsoft Res Asia, Sigma Ctr 5F, Beijing 100080, Peoples R China
[3] Chinese Univ Hong Kong, Shatin, Hong Kong, Peoples R China
[4] Microsoft Corp, Redmond, WA 98052 USA
关键词
information extraction; metadata extraction; machine learning; search;
D O I
10.1016/j.ipm.2005.12.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to other domains, and more surprisingly we can even train models in one language and apply them to other languages. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1276 / 1293
页数:18
相关论文
共 27 条
[1]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[2]  
CHIEU HL, 2002, P 18 NAT C ART INT, P768
[3]  
Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1
[4]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[5]  
CRYSTAL A, 2003, MET SEARCH GLOB CORP
[6]  
EVANS DK, 2004, P HUM LANG TECHN C N, P1
[7]   Factorial hidden Markov models [J].
Ghahramani, Z ;
Jordan, MI .
MACHINE LEARNING, 1997, 29 (2-3) :245-273
[8]  
GHEEL J, 1999, P 1999 INT C INF VIS, P464
[9]  
GILES CL, 2003, P 26 ANN INT ACM SIG, P413
[10]  
GIUFFRIDA G, 2000, P 5 ACM C DIG LIB, P77