Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry

被引:1
|
作者
Cordeiro, Fabio Correa [1 ]
da Silva, Patricia Ferreira [2 ]
Tessarollo, Alexandre [2 ]
Freitas, Claudia [3 ,4 ]
de Souza, Elvis [3 ]
Gomes, Diogo da Silva Magalhaes [2 ]
Souza, Renato Rocha [1 ]
Coelho, Flavio Codeco [1 ]
机构
[1] Getulio Vargas Fdn, Praia Botafogo 190, BR-22250900 Rio De Janeiro, Brazil
[2] Petrobras Res & Dev Ctr CENPES, Ave Horacio Macedo 950, BR-21941915 Rio De Janeiro, Brazil
[3] Pontificia Univ Catolica Rio de Janeiro, Rua Marques Sao Vicente 225, BR-22451900 Rio de Janeiro, Brazil
[4] ICMC USP, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, Brazil
关键词
Natural language processing; Information extraction; Ontology; Knowledge graphs; Linguistic corpora;
D O I
10.1016/j.cageo.2024.105714
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents Petro NLP, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The Petro NLP resources comprise: (i) Petro KGraph- a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) Petrol & ecirc;s, PetroGold, PetroNER, and PetroRE- sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Voice-based Road Navigation System Using Natural Language Processing (NLP)
    Withanage, Pooja
    Liyanage, Tharaka
    Deeyakaduwe, Naditha
    Dias, Eshan
    Thelijjagoda, Samantha
    2018 IEEE 9TH INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION FOR SUSTAINABILITY (ICIAFS' 2018), 2018,
  • [42] Natural Language Processing (NLP) in Qualitative Public Health Research: A Proof of Concept Study
    Leeson, William
    Resnick, Adam
    Alexander, Daniel
    Rovers, John
    INTERNATIONAL JOURNAL OF QUALITATIVE METHODS, 2019, 18
  • [43] Data Extraction by Using Natural Language Processing Tool
    More, Sujata D.
    Madankar, Mangala S.
    Chandak, M. B.
    HELIX, 2018, 8 (05): : 3846 - 3848
  • [44] Review on Natural Language Processing (NLP) and Its Toolkits for Opinion Mining and Sentiment Analysis
    Solangi, Yasir Ali
    Solangi, Zulfiqar Ali
    Aarain, Samreen
    Abro, Amna
    Mallah, Ghulam Ali
    Shah, Asadullah
    2018 5TH IEEE INTERNATIONAL CONFERENCE ON ENGINEERING TECHNOLOGIES AND APPLIED SCIENCES (IEEE ICETAS), 2018,
  • [45] NAS-Bench-NLP: Neural Architecture Search Benchmark for Natural Language Processing
    Klyuchnikov, Nikita
    Trofimov, Ilya
    Artemova, Ekaterina
    Salnikov, Mikhail
    Fedorov, Maxim
    Filippov, Alexander
    Burnaev, Evgeny
    IEEE ACCESS, 2022, 10 : 45736 - 45747
  • [46] Extracting phenotypic information from the literature via natural language processing
    Chen, LF
    Friedman, C
    MEDINFO 2004: PROCEEDINGS OF THE 11TH WORLD CONGRESS ON MEDICAL INFORMATICS, PT 1 AND 2, 2004, 107 : 758 - 762
  • [47] Biomolecular Event Extraction using Natural Language Processing
    Bali, Manish
    Anandaraj, S. P.
    INTERNATIONAL JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS, 2023, 14 (05) : 601 - 612
  • [48] Information Extraction from Natural Language Using Universal Networking Language
    Saha, Aloke Kumar
    Mridha, M. F.
    Rafiq, Jahir Ibna
    Das, Jugal K.
    ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, IC4S 2018, 2019, 924 : 283 - 292
  • [49] Solutions of Creating Large Data Resources in Natural Language Processing
    Huynh Cong Phap
    RECENT DEVELOPMENTS IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, 2016, 642 : 243 - 253
  • [50] A Novel Natural Language Processing (NLP)-Based Machine Translation Model for English to Pakistan Sign Language Translation
    Khan, Nabeel Sabir
    Abid, Adnan
    Abid, Kamran
    COGNITIVE COMPUTATION, 2020, 12 (04) : 748 - 765