Figure search by text in large scale digital document collections

被引:3
作者
Yurtsever, M. Mucahit Enes [1 ]
Ozcan, Muhammet [2 ]
Taruz, Zubeyir [2 ]
Eken, Suleyman [1 ]
Sayar, Ahmet [2 ]
机构
[1] Kocaeli Univ, Dept Informat Syst Engn, Umuttepe Campus, TR-41001 Kocaeli, Turkey
[2] Kocaeli Univ, Dept Comp Engn, Kocaeli, Turkey
关键词
Apache Solr; document digitization; Elasticsearch; figure search; full-text search; regular expressions; RETRIEVAL;
D O I
10.1002/cpe.6529
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Digital document collections have been created with the transfer of a large number of documents to digital media. These digital archives have provided many benefits to users. As the diversity and size of digital image collections have grown exponentially, it has become increasingly important and difficult to obtain the desired image from them. The images on the document might contain critical information about the subject of it. In this study, an architecture is developed that can work on large-scale data by creating regular expressions together with full-text search approaches. The performance of the system has been tested on different academic documents and Elasticsearch and Apache Solr insert times are compared. Compared to Elasticsearch, Apache Solr achieved faster and more successful results.
引用
收藏
页数:11
相关论文
共 32 条
  • [1] BaezaYates R., 1999, MODERN INFORM RETRIE, V463
  • [2] Maintaining interoperability in open source software: A case study of the Apache PDFBox project
    Butler, Simon
    Gamalielsson, Jonas
    Lundell, Bjorn
    Brax, Christoffer
    Mattsson, Anders
    Gustaysson, Tomas
    Feist, Jonas
    Lonroth, Erik
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 159
  • [3] Cambridge U. P, 2009, INTRO INFORM RETRIEV
  • [4] Integrating Multiple Models Using Image-as-Documents Approach for Recognizing Fine-Grained Home Contexts
    Chen, Sinan
    Saiki, Sachio
    Nakamura, Masahide
    [J]. SENSORS, 2020, 20 (03)
  • [5] PDFFigures 2.0: Mining Figures from Research Papers
    Clark, Christopher
    Divvala, Santosh
    [J]. 2016 IEEE/ACM JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), 2016, : 143 - 152
  • [6] Croft W. B., 2010, Search engines: Information retrieval in practice, V520
  • [7] Content-Based Image Retrieval Research
    Duan, Guoyong
    Yang, Jing
    Yang, Yilong
    [J]. 2011 INTERNATIONAL CONFERENCE ON PHYSICS SCIENCE AND TECHNOLOGY (ICPST), 2011, 22 : 471 - 477
  • [8] Eidenberger H, 2004, PROC SPIE, V5307, P133
  • [9] Eken, 2018, DUZCE U BILIM VE TEK, V6, P68
  • [10] DoCA: A Content-Based Automatic Classification System Over Digital Documents
    Eken, Suleyman
    Menhour, Houssem
    Koksal, Kubra
    [J]. IEEE ACCESS, 2019, 7 : 97996 - 98004