Content Extraction of Biological Datasets Using Soft Computing Techniques

被引：4

作者：

Prakash, Kolla Bhanu ^{[1
,2
]}

Rangaswamy, M. A. Dorai ^{[1
]}

机构：

[1] Sathyabama Univ, Fac Comp Sci Engn, Madras 600119, Tamil Nadu, India

[2] Chirala Engn Coll, Fac Comp, Chirala 523157, India

来源：

JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS | 2016年 / 6卷 / 04期

关键词：

Content Extraction; Biology; Attribute; Multilingual; Pattern;

D O I：

10.1166/jmihi.2016.1931

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Content extraction and identification of biological datasets is gaining prominence in the present day. Since, many biological datasets are available online, it becomes difficult to identify or extract content from similar datasets. Especially, when it comes to multilingual web documents, this becomes more difficult. Content extraction is the process of identifying main content of a web page which may consist of different forms of data in an unstructured and non-homogeneous manner. The present study is an attempt to develop a pixel-based approach-which gives flexibility in dealing with any language or media- and start from generic text level to a hybrid unstructured level. The proposed technique is purely data driven and does not make use of domain dependent background information, nor does it rely on predefined document categories or a given list of topics. Model is tested with different attribute inputs and it is found that a minimum of 2 x 2 attribute is required to assess the content. But after testing with several biological data sets it is found that 3 x 3 attribute gives better result for analysis and content extraction. This is later tested with other language words to form a more elaborate base set.

引用

页码：932 / 936

页数：5