Automatic Regular Expression Generation for Extracting Relevant Image Data From Web Pages Using Genetic Algorithms

被引:0
作者
Aslanyurek, Canan [1 ]
Yerlikaya, Tarik [2 ]
机构
[1] Kirklareli Univ, TR-39100 Kirklareli, Turkiye
[2] Trakya Univ, TR-22030 Edirne, Turkiye
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Data mining; Genetic algorithms; Web pages; Feature extraction; Statistics; Sociology; Manuals; Image reconstruction; automatic regular expressions; web data extraction; image data extraction;
D O I
10.1109/ACCESS.2024.3420734
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this study, a method that automatically generates regular expressions using genetic algorithms is designed to extract relevant images on web pages. Data extraction, which is usually done with web scrapers, can also be done with regular expressions. The complexity of regular expressions and the fact that they require expert knowledge make their writing difficult. With this study, a regular expression is automatically created to obtain relevant images of news content on websites. With the principle of genetic algorithms, the survival of the good and the elimination of the bad, a regular expression that can reach the most relevant image is produced. Thus, instead of a time-consuming and error-prone method such as creating the appropriate pattern for each site with web scraper tools, automatic regular expression generation using genetic algorithm methods can be used as a better method. A data set containing text-based related and irrelevant images from 200 websites collected from 58 countries was used in the study. There are 22,682 relevant images among 635,015 image data in the dataset. With the method developed using the genetic algorithm, the rate of accessing the relevant images by regular expressions produced by only looking at the relevant image data is approximately 98.49%.
引用
收藏
页码:90660 / 90669
页数:10
相关论文
共 38 条
  • [1] An efficient regular expression inference approach for relevant image extraction
    Agun, Hayri Volkan
    Uzun, Erdinc
    [J]. APPLIED SOFT COMPUTING, 2023, 135
  • [2] Babbar N., 2010, P 4 WORKSH AN NOIS U
  • [3] Adapting Searchy to extract data using evolved wrappers
    Barrero, David F.
    R-Moreno, Maria D.
    Camacho, David
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (03) : 3061 - 3070
  • [4] Barrero DF, 2009, DATA MINING AND MULTI-AGENT INTEGRATION, P143, DOI 10.1007/978-1-4419-0522-2_9
  • [5] Active Learning of Regular Expressions for Entity Extraction
    Bartoli, Alberto
    De Lorenzo, Andrea
    Medvet, Eric
    Tarlao, Fabiano
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (03) : 1067 - 1080
  • [6] Can a Machine Replace Humans in Building Regular Expressions? A Case Study
    Bartoli, Alberto
    De Lorenzo, Andrea
    Medvet, Eric
    Tarlao, Fabiano
    [J]. IEEE INTELLIGENT SYSTEMS, 2016, 31 (06) : 15 - 21
  • [7] Inference of Regular Expressions for Text Extraction from Examples
    Bartoli, Alberto
    De Lorenzo, Andrea
    Medvet, Eric
    Tarlao, Fabiano
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (05) : 1217 - 1230
  • [8] Bartoli G., 2012, P 14 ANN C COMP GEN
  • [9] Bhardwaj A., 2014, J. Emerg. Technol. Web Intell., V6
  • [10] Brauer R., 2011, P 20 ACM INT C INF K