Automatic detection and extraction of key resources from tables in biomedical papers

被引:0
|
作者
Ozyurt, Ibrahim Burak [1 ]
Bandrowski, Anita [1 ]
机构
[1] UCSD, FDI Lab Dept Neurosci, 9500 Gilman Dr M-C 0608, La Jolla, CA 92093 USA
来源
BIODATA MINING | 2025年 / 18卷 / 01期
基金
美国国家卫生研究院;
关键词
Table extraction; Scientific reproducibility; Information extraction; Natural language processing; Language modeling; Bioinformatics; STRUCTURE RECOGNITION;
D O I
10.1186/s13040-025-00438-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
BackgroundTables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the "findability" of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these 'resource table candidates' automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.MethodsWe introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, "Table Transformer" models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.ResultsThe extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.ConclusionsOur pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.
引用
收藏
页数:18
相关论文
共 50 条
  • [41] Information Extraction from Research Papers Based on Statistical Methods
    Kavila, Selvani Deepthi
    Rani, D. Fathima
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 3, 2016, 381 : 573 - 580
  • [42] Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study
    Paramonov, Viacheslav
    Shigarov, Alexey
    Vetrova, Varvara
    INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2021, 2021, 1486 : 84 - 95
  • [43] Extraction and segmentation of tables from Chinese ink documents based on a matrix model
    Zhang, Xi-wen
    Lyu, Michael R.
    Dai, Guo-zhong
    PATTERN RECOGNITION, 2007, 40 (07) : 1855 - 1867
  • [44] TABLEX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables
    Desai, Harsh
    Kayal, Pratik
    Singh, Mayank
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 554 - 569
  • [45] Information Extraction from Research Papers by Data Integration and Data Validation from Multiple Header Extraction Sources
    Saleem, Ozair
    Latif, Seemab
    WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, WCECS 2012, VOL I, 2012, : 215 - 219
  • [46] A Framework for the Automatic Extraction of Rules from Online Text
    Hassanpour, Saeed
    O'Connor, Martin J.
    Das, Amar K.
    RULE-BASED REASONING, PROGRAMMING, AND APPLICATIONS, 2011, 6826 : 266 - 280
  • [47] Automatic Extraction of ICT Competences from Unstructured Sources
    Janev, Valentina
    Mijovic, Vuk
    Vranes, Sanja
    ENTERPRISE INFORMATION SYSTEMS PT II, 2010, 110 : 391 - 400
  • [48] Automatic semantic relation extraction from Portuguese texts
    Taba, Leonardo Sameshima
    Caseli, Helena de Medeiros
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2739 - 2746
  • [49] Automatic extraction of linguistic knowledge from an International Classification
    Baud, R
    Lovis, C
    Rassinoux, AM
    Michel, PA
    Scherrer, JR
    MEDINFO '98 - 9TH WORLD CONGRESS ON MEDICAL INFORMATICS, PTS 1 AND 2, 1998, 52 : 581 - 585
  • [50] Automatic Extraction of Performance Indicators from Financial Statements
    Kamaruddin, Siti Sakira
    Hamdan, Abdul Razak
    Abu Bakar, Azuraliza
    Nor, Fauzias Mat
    2009 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS, VOLS 1 AND 2, 2009, : 337 - 339