An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus

被引：4

作者：

Chen, Liang-Ching ^{[1
]}

机构：

[1] ROC Mil Acad, Dept Foreign Languages, Kaohsiung 830, Taiwan

来源：

DATA & KNOWLEDGE ENGINEERING | 2024年 / 153卷

关键词：

Keyword extraction; Natural Language Processing (NLP); Corpus linguistic; Dunning's Log-Likelihood Test (LLT); Extended Term Frequency-Inverse Document; Frequency (TF-IDF) method; Climate change;

D O I：

10.1016/j.datak.2024.102322

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning's Log -Likelihood Test (LLT) has long been integrated into corpus software as a statistic -based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpusbased research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub -corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency -Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus -based research.

引用

页数：12

共 42 条

[1] Greenhouse gas emissions and corporate social responsibility in USA: A comprehensive study using dynamic panel model
Ahmad, Khaleeq
Younas, Zahid Irshad
Manzoor, Wajiha
Safdar, Nabeel
[J]. HELIYON, 2023, 9 (03)
[2] Environmental sustainability in the online media discourses of Saudi Arabia: A corpus-based study of keyness, intertextuality, and interdiscursivity
Almaghlouth, Shrouq
[J]. PLOS ONE, 2022, 17 (11):
[3] Anthony L., 2022, AntConc (Version 4.0.5) Computer Software
[4] Baker P., 2006, USING CORPORA DISCOU
[5] A corpus-based genre analysis of letters of regularization: The case of land institutions in Ghana
Bonsu, Emmanuel Mensah
Afful, Joseph Benjamin Archibald
Hu, Guangwei
[J]. IBERICA, 2023, (45): : 215 - 242
[6] Militant, annoying and sexy: a corpus-based study of representations of vegans in the British press
Brookes, Gavin
Chalupnik, Malgorzata
[J]. CRITICAL DISCOURSE STUDIES, 2023, 20 (02) : 218 - 236
[7] 'Lose weight, save the NHS': Discourses of obesity in press coverage of COVID-19
Brookes, Gavin
[J]. CRITICAL DISCOURSE STUDIES, 2022, 19 (06) : 629 - 647
[8] Observations of greenhouse gases as climate indicators
Bruhwiler, Lori
Basu, Sourish
Butler, James H.
Chatterjee, Abhishek
Dlugokencky, Ed
Kenney, Melissa A.
McComiskey, Allison
Montzka, Stephen A.
Stanitski, Diane
[J]. CLIMATIC CHANGE, 2021, 165 (1-2)
[9] Discrepancies in the portrayal of the COVID-19 vaccine in Chinese and US international media outlets: A corpus-based discursive news values analysis
Chen, Cheng
Liu, Renping
[J]. GLOBAL PUBLIC HEALTH, 2023, 18 (01)
[10] An Improved Corpus-Based NLP Method for Facilitating Keyword Extraction: An Example of the COVID-19 Vaccine Hesitancy Corpus
Chen, Liang-Ching
[J]. SUSTAINABILITY, 2023, 15 (04)

← 1 2 3 4 5 →