Content Extraction of Biological Datasets Using Soft Computing Techniques

被引:4
作者
Prakash, Kolla Bhanu [1 ,2 ]
Rangaswamy, M. A. Dorai [1 ]
机构
[1] Sathyabama Univ, Fac Comp Sci Engn, Madras 600119, Tamil Nadu, India
[2] Chirala Engn Coll, Fac Comp, Chirala 523157, India
关键词
Content Extraction; Biology; Attribute; Multilingual; Pattern;
D O I
10.1166/jmihi.2016.1931
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Content extraction and identification of biological datasets is gaining prominence in the present day. Since, many biological datasets are available online, it becomes difficult to identify or extract content from similar datasets. Especially, when it comes to multilingual web documents, this becomes more difficult. Content extraction is the process of identifying main content of a web page which may consist of different forms of data in an unstructured and non-homogeneous manner. The present study is an attempt to develop a pixel-based approach-which gives flexibility in dealing with any language or media- and start from generic text level to a hybrid unstructured level. The proposed technique is purely data driven and does not make use of domain dependent background information, nor does it rely on predefined document categories or a given list of topics. Model is tested with different attribute inputs and it is found that a minimum of 2 x 2 attribute is required to assess the content. But after testing with several biological data sets it is found that 3 x 3 attribute gives better result for analysis and content extraction. This is later tested with other language words to form a more elaborate base set.
引用
收藏
页码:932 / 936
页数:5
相关论文
共 12 条
[11]  
Rahman A.F. R., 2001, WDA, P7
[12]  
Sellen A. J., 2002, Conference Proceedings. Conference on Human Factors in Computing Systems. CHI 2002, P227, DOI 10.1145/503376.503418