An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus

被引:4
作者
Chen, Liang-Ching [1 ]
机构
[1] ROC Mil Acad, Dept Foreign Languages, Kaohsiung 830, Taiwan
关键词
Keyword extraction; Natural Language Processing (NLP); Corpus linguistic; Dunning's Log-Likelihood Test (LLT); Extended Term Frequency-Inverse Document; Frequency (TF-IDF) method; Climate change;
D O I
10.1016/j.datak.2024.102322
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning's Log -Likelihood Test (LLT) has long been integrated into corpus software as a statistic -based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpusbased research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub -corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency -Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus -based research.
引用
收藏
页数:12
相关论文
共 42 条
  • [1] Greenhouse gas emissions and corporate social responsibility in USA: A comprehensive study using dynamic panel model
    Ahmad, Khaleeq
    Younas, Zahid Irshad
    Manzoor, Wajiha
    Safdar, Nabeel
    [J]. HELIYON, 2023, 9 (03)
  • [2] Environmental sustainability in the online media discourses of Saudi Arabia: A corpus-based study of keyness, intertextuality, and interdiscursivity
    Almaghlouth, Shrouq
    [J]. PLOS ONE, 2022, 17 (11):
  • [3] Anthony L., 2022, AntConc (Version 4.0.5) Computer Software
  • [4] Baker P., 2006, USING CORPORA DISCOU
  • [5] A corpus-based genre analysis of letters of regularization: The case of land institutions in Ghana
    Bonsu, Emmanuel Mensah
    Afful, Joseph Benjamin Archibald
    Hu, Guangwei
    [J]. IBERICA, 2023, (45): : 215 - 242
  • [6] Militant, annoying and sexy: a corpus-based study of representations of vegans in the British press
    Brookes, Gavin
    Chalupnik, Malgorzata
    [J]. CRITICAL DISCOURSE STUDIES, 2023, 20 (02) : 218 - 236
  • [7] 'Lose weight, save the NHS': Discourses of obesity in press coverage of COVID-19
    Brookes, Gavin
    [J]. CRITICAL DISCOURSE STUDIES, 2022, 19 (06) : 629 - 647
  • [8] Observations of greenhouse gases as climate indicators
    Bruhwiler, Lori
    Basu, Sourish
    Butler, James H.
    Chatterjee, Abhishek
    Dlugokencky, Ed
    Kenney, Melissa A.
    McComiskey, Allison
    Montzka, Stephen A.
    Stanitski, Diane
    [J]. CLIMATIC CHANGE, 2021, 165 (1-2)
  • [9] Discrepancies in the portrayal of the COVID-19 vaccine in Chinese and US international media outlets: A corpus-based discursive news values analysis
    Chen, Cheng
    Liu, Renping
    [J]. GLOBAL PUBLIC HEALTH, 2023, 18 (01)