Automatic extraction of significant terms from the title and abstract of scientific papers using the machine learning algorithm: A multiple module approach

被引:1
作者
Mukherjee, Bhaskar [1 ]
Majhi, Debasis [1 ]
机构
[1] Banaras Hindu Univ, Dept Lib & Informat Sci, Varanasi, India
关键词
Data mining; Title extraction; Natural Language Processing; YAKE; NLTK; Keyword Extraction-NLP;
D O I
10.56042/alis.v70i1.71272
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Keyword extraction is the task of identifying important terms or phrase that are most representative of the source document. Although the process of automatic extraction of keywords from title is an old method, it was mainly for extraction from a single web document. Our approach differs from previous research works on keyword extraction in several aspects. For those who are non-expert of the scientific fields, understating scientific research trends is difficult. The purpose of this study is to develop an automatic method of obtaining overviews of a scientific field for non-experts by capturing research trends. This empirical study excavates significant term extraction using Natural Language Processing (NLP) tools. More than 15000 titles saved in a.csv file was our dataset and scripts written in Python were our process to compare how far significant terms of scientific title corpus are similar or different to the terms available in the abstract of that same scientific article corpus. A light-weight unsupervised title extractor, Yet Another Keyword Extractor (YAKE) was used to extract the results. Based on our analysis, it can be concluded that these algorithms can be used for other fields too by the non-experts of that subject field to perform automatic extraction of significant words and understanding trends. Our algorithm could be a solution to reduce the labour-intensive manual indexing process.
引用
收藏
页码:33 / 40
页数:8
相关论文
共 15 条
  • [1] Writing good abstracts
    Alexandrov, Andrei V.
    Hennerici, Michael G.
    [J]. CEREBROVASCULAR DISEASES, 2007, 23 (04) : 256 - 259
  • [2] Bavdekar Sandeep B, 2016, J Assoc Physicians India, V64, P53
  • [3] YAKE! Keyword extraction from single documents using multiple local features
    Campos, Ricardo
    Mangaravite, Vitor
    Pasquali, Arian
    Jorge, Alipio
    Nunes, Celia
    Jatowt, Adam
    [J]. INFORMATION SCIENCES, 2020, 509 : 257 - 289
  • [4] Constantin Alexandru., 2013, Proceedings of the 2013 ACM symposium on Document engineering, P177, DOI DOI 10.1145/2494266.2494271
  • [5] Content-based Title Extraction from Web Page
    Gali, Najlah
    Franti, Pasi
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 2 (WEBIST), 2016, : 204 - 210
  • [6] Giuffrida G., 2000, ACM 2000. Digital Libraries. Proceedings of the Fifth ACM Conference on Digital Libraries, P77, DOI 10.1145/336597.336639
  • [7] Gunawan Dani, 2020, 2020 4rd International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM). Proceedings, P260, DOI 10.1109/ELTICOM50775.2020.9230514
  • [8] Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
    Jiang, Xiaobo
    He, Kun
    Yang, Borui
    [J]. IEEE ACCESS, 2022, 10 : 29367 - 29376
  • [9] Topic extraction to provide an overview of research activities: The case of the high-temperature superconductor and simulation and modelling
    Nakajima, Ritsuko
    Midorikawa, Nobuyuki
    [J]. JOURNAL OF INFORMATION SCIENCE, 2021, 47 (05) : 590 - 608
  • [10] Rapid Automatic Keyword Extraction and Word Frequency in Scientific Article Keywords Extraction
    Rinartha, Komang
    Kartika, Luh Gede Surya
    [J]. 3RD INTERNATIONAL CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS (ICORIS 2021), 2021, : 216 - 219