Newspaper Article Extraction Using Hierarchical Fixed Point Model

被引:8
作者
Bansal, Anukriti [1 ]
Chaudhury, Santanu [1 ]
Roy, Sumantra Dutta [1 ]
Srivastava, J. B. [2 ]
机构
[1] Indian Inst Technol Delhi, Dept Elect Engn, New Delhi, India
[2] Indian Inst Technol Delhi, Dept Math, New Delhi, India
来源
2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014) | 2014年
关键词
DOCUMENT STRUCTURE; IMAGES;
D O I
10.1109/DAS.2014.42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a novel learning based framework to extract articles from newspaper images using a Fixed-Point Model. The input to the system comprises blocks of text and graphics, obtained using standard image processing techniques. The fixed point model uses contextual information and features of each block to learn the layout of newspaper images and attains a contraction mapping to assign a unique label to every block. We use a hierarchical model which works in two stages. In the first stage, a semantic label (heading, sub-heading, text-blocks, image and caption) is assigned to each segmented block. The labels are then used as input to the next stage to group the related blocks into news articles. Experimental results show the applicability of our algorithm in newspaper labeling and article extraction.
引用
收藏
页码:257 / 261
页数:5
相关论文
共 22 条
[1]  
Aggarwal S., 2012, P WORKSH DOC AN REC, P55
[2]   Textual article clustering in newspaper pages [J].
Aiello, Marco ;
Pegoretti, Andrea .
APPLIED ARTIFICIAL INTELLIGENCE, 2006, 20 (09) :767-796
[3]  
[Anonymous], 2001, ICML 01 P 18 INT C M
[4]  
Beretta R., 2011, P 2011 INT C DOC AN, P394
[5]  
Bloomberg D. S., 1991, P 1991 INT C DOC AN
[6]  
Bukhari S. S., 2012, ICFHR, P639
[7]  
Cattoni R., 1998, TECHNICAL REPORT
[8]  
Chaudhury S, 2009, LECT NOTES COMPUT SC, V5909, P375, DOI 10.1007/978-3-642-11164-8_61
[9]  
Fan RE, 2008, J MACH LEARN RES, V9, P1871
[10]  
Furmaniak R., 2007, UNSUPERVISED NEWSPAP, P1263