Text segmentation based on document understanding for information retrieval

被引:0
作者
Prince, Violaine [1 ]
Labadie, Alexandre [1 ]
机构
[1] LIRMM, 161 Rue Ada, F-34392 Montpellier 5, France
来源
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS | 2007年 / 4592卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information retrieval needs to match relevant texts with a given query. Selecting appropriate parts is useful when documents are 4 long, and only portions are interesting to the user. In this paper, we 9 describe a method that extensively uses natural language techniques for text segmentation based on topic change detection. The method requires a NLP-parser and a semantic representation in Roget-based vectors. We have run the experiment on French documents, for which we have the appropriate tools, but the method could be transposed to any other language with the same requirements. The article sketches an overview of the NL understanding environment functionalities, and the algorithms related to our text segmentation method. An experiment in text segmentation is also presented and its result in an information retrieval task is shown.
引用
收藏
页码:295 / +
页数:3
相关论文
共 50 条
[41]   Document Image Retrieval with Morphology-based Segmentation and Features Combination [J].
Bockholt, Tiago C. ;
Cavalcanti, George D. C. ;
Mello, Carlos A. B. .
DOCUMENT RECOGNITION AND RETRIEVAL XVIII, 2011, 7874
[42]   Segmentation-based retrieval of document images from diverse collections [J].
Moll, Michael A. ;
Baird, Henry S. .
DOCUMENT RECOGNITION AND RETRIEVAL XV, 2008, 6815
[43]   GMM Adaptation based Online Speaker Segmentation for Spoken Document Retrieval [J].
Park, Kyungmi ;
Park, Jeong-sik ;
Oh, Yung-Hwan .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (02) :1123-1129
[44]   Voice-based Information Retrieval - how far are we from the text-based information retrieval ? [J].
Lee, Lin-shan ;
Pan, Yi-cheng .
2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, :26-43
[45]   Document retrieval system - Tolerant of segmentation errors of document images [J].
Nagasaki, T ;
Takahashi, T ;
Marukawa, K .
NINTH INTERNATIONAL WORKSHOP ON FRONTIERS IN HANDWRITING RECOGNITION, PROCEEDINGS, 2004, :280-285
[46]   Document retrieval system tolerant of segmentation errors of document images [J].
Nagasaki, T. (naga-t@crl.hitachi.co.jp), Hitachi, Japan; IBM, USA; Fujitsu Laboratories, Japan; NEC, Japan; Toshiba, Japan (IEEE Computer Society)
[47]   Arabic Document Indexing for Improved Text Retrieval [J].
Al-Lahham, Yaser A. M. .
2019 2ND INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2019, :226-230
[48]   Imaged document text retrieval without OCR [J].
Tan, CL ;
Huang, WH ;
Yu, ZH ;
Xu, Y .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (06) :838-844
[49]   Text databases and information retrieval [J].
ACM Comput Surv, 1 (133)
[50]   Text Information Retrieval in Tetun [J].
de Jesus, Gabriel .
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT III, 2023, 13982 :429-435