Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry

被引:1
|
作者
Cordeiro, Fabio Correa [1 ]
da Silva, Patricia Ferreira [2 ]
Tessarollo, Alexandre [2 ]
Freitas, Claudia [3 ,4 ]
de Souza, Elvis [3 ]
Gomes, Diogo da Silva Magalhaes [2 ]
Souza, Renato Rocha [1 ]
Coelho, Flavio Codeco [1 ]
机构
[1] Getulio Vargas Fdn, Praia Botafogo 190, BR-22250900 Rio De Janeiro, Brazil
[2] Petrobras Res & Dev Ctr CENPES, Ave Horacio Macedo 950, BR-21941915 Rio De Janeiro, Brazil
[3] Pontificia Univ Catolica Rio de Janeiro, Rua Marques Sao Vicente 225, BR-22451900 Rio de Janeiro, Brazil
[4] ICMC USP, Ave Trabalhador Sao Carlense 400, BR-13566590 Sao Carlos, Brazil
关键词
Natural language processing; Information extraction; Ontology; Knowledge graphs; Linguistic corpora;
D O I
10.1016/j.cageo.2024.105714
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents Petro NLP, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The Petro NLP resources comprise: (i) Petro KGraph- a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) Petrol & ecirc;s, PetroGold, PetroNER, and PetroRE- sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Improved neural machine translation using Natural Language Processing (NLP)
    Sk Hasane Ahammad
    Ruth Ramya Kalangi
    S. Nagendram
    Syed Inthiyaz
    P. Poorna Priya
    Osama S. Faragallah
    Alsharef Mohammad
    Mahmoud M. A. Eid
    Ahmed Nabih Zaki Rashed
    Multimedia Tools and Applications, 2024, 83 : 39335 - 39348
  • [22] USING NATURAL LANGUAGE PROCESSING FOR AUTOMATIC EXTRACTION OF ONTOLOGY INSTANCES
    Faria, Carla
    Girardi, Rosario
    Serra, Ivo
    Macedo, Maria
    Maranhao, Djefferson
    ICEIS 2010: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 2: ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS, 2010, : 278 - 283
  • [23] Natural language querying of databases: an information extraction approach in the conceptual query language
    Owei, V
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2000, 53 (04) : 439 - 492
  • [24] Natural Language Processing for Information Extraction of Gastric Diseases and Its Application in Large-Scale Clinical Research
    Song, Gyuseon
    Chung, Su Jin
    Seo, Ji Yeon
    Yang, Sun Young
    Jin, Eun Hyo
    Chung, Goh Eun
    Shim, Sung Ryul
    Sa, Soonok
    Hong, Moongi Simon
    Kim, Kang Hyun
    Jang, Eunchan
    Lee, Chae Won
    Bae, Jung Ho
    Han, Hyun Wook
    JOURNAL OF CLINICAL MEDICINE, 2022, 11 (11)
  • [25] Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing
    Sivarajkumar, Sonish
    Tam, Thomas Yu Chow
    Mohammad, Haneef Ahamed
    Viggiano, Samuel
    Oniani, David
    Visweswaran, Shyam
    Wang, Yanshan
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2217 - 2227
  • [26] Design of an Image Content Understanding and Information Extraction Algorithm Integrating Natural Language Processing
    Pang, Ling
    Li, Aihua
    TRAITEMENT DU SIGNAL, 2024, 41 (06) : 2839 - 2850
  • [27] Jurisprudence search in Colombia based on natural language processing (NLP) and Lynked Data
    Camilo Ordonez, Cristian
    Armando Ordonez, Jose
    Ordonez Eraso, Hugo Armando
    Urbano, Franco
    INGE CUC, 2020, 16 (02)
  • [28] Teaching Natural Language Processing (NLP) Using Ontology Based Education Design
    Rehman, Zobia
    Kifor, Stefania
    3RD INTERNATIONAL ENGINEERING AND TECHNOLOGY EDUCATION CONFERENCE & 7TH BALKAN REGION CONFERENCE ON ENGINEERING AND BUSINESS EDUCATION, 2015,
  • [29] Opinion Mining and thought Pattern Classification with Natural Language Processing (NLP) Tools
    Naqvi, Sayyada Muntaha Azim
    Awais, Muhammad
    Saeed, Muhammad Yahya
    Mohsin, Muhammad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (10) : 485 - 493
  • [30] The State of the Art of Natural Language Processing-A Systematic Automated Review of NLP Literature Using NLP Techniques
    Sawicki, Jan
    Ganzha, Maria
    Paprzycki, Marcin
    DATA INTELLIGENCE, 2023, 5 (03) : 707 - 749