Extracting subject from internet news by string match

被引：0

作者：

Yin, Zhong-Hang ^{[1
]}

Wang, Yong-Cheng ^{[1
]}

Cai, Wei ^{[1
]}

Han, Ke-Song ^{[1
]}

机构：

[1] Sch. of Electron. and Info. Technol., Shanghai Jiaotong Univ., Shanghai 200030, China

来源：

Ruan Jian Xue Bao/Journal of Software | 2002年 / 13卷 / 02期

关键词：

Internet news - Repeated strings - String matching - Subject extraction - Web information processing;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Subject extraction from a text is very important for natural language processing. Traditional methods mainly depend on the mode of thesaurus plus match. It is not fit to process Internet news because of its limited volume and slow update speed. After analyzing the news structure carefully, this paper presents a new practical method to extract news subjects without thesaurus, and gives the main implementing procedure. Instead of large thesaurus, it uses the special structure of Internet news to find the repeated strings. These repeated strings could express the news subjects very well. Experimental results show that this method can extract the most important subject strings from most of Internet news rapidly and efficiently. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.

引用

页码：159 / 167