A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case

被引:8
作者
Bounabi, Mariem [1 ]
Elmoutaouakil, Karim [2 ]
Satori, Khalid [1 ]
机构
[1] Univ Sidi Mohamed Ben Abdellah, Fac Sci Dhar El Mahraz Fes, Comp Sci Signals Automat & Cognitivism Lab LISAC, Fes, Morocco
[2] Sidi Mohamed Ben Abdallah Univ, Multidisciplinary Fac Taza, Engn Sci Lab, Taza, Morocco
关键词
Web mining; artificial intelligence; Machine learning; Fuzzy logic; Neutrosophic logic; Fuzzy TF-IDF; Neutrosophic TF-IDF; Text classification;
D O I
10.1108/IJWIS-11-2020-0067
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose This paper aims to present a new term weighting approach for text classification as a text mining task. The original method, neutrosophic term frequency - inverse term frequency (NTF-IDF), is an extended version of the popular fuzzy TF-IDF (FTF-IDF) and uses the neutrosophic reasoning to analyze and generate weights for terms in natural languages. The paper also propose a comparative study between the popular FTF-IDF and NTF-IDF and their impacts on different machine learning (ML) classifiers for document categorization goals. Design/methodology/approach After preprocessing textual data, the original Neutrosophic TF-IDF applies the neutrosophic inference system (NIS) to produce weights for terms representing a document. Using the local frequency TF, global frequency IDF and text N's length as NIS inputs, this study generate two neutrosophic weights for a given term. The first measure provides information on the relevance degree for a word, and the second one represents their ambiguity degree. Next, the Zhang combination function is applied to combine neutrosophic weights outputs and present the final term weight, inserted in the document's representative vector. To analyze the NTF-IDF impact on the classification phase, this study uses a set of ML algorithms. Findings Practicing the neutrosophic logic (NL) characteristics, the authors have been able to study the ambiguity of the terms and their degree of relevance to represent a document. NL's choice has proven its effectiveness in defining significant text vectorization weights, especially for text classification tasks. The experimentation part demonstrates that the new method positively impacts the categorization. Moreover, the adopted system's recognition rate is higher than 91%, an accuracy score not attained using the FTF-IDF. Also, using benchmarked data sets, in different text mining fields, and many ML classifiers, i.e. SVM and Feed-Forward Network, and applying the proposed term scores NTF-IDF improves the accuracy by 10%. Originality/value The novelty of this paper lies in two aspects. First, a new term weighting method, which uses the term frequencies as components to define the relevance and the ambiguity of term; second, the application of NL to infer weights is considered as an original model in this paper, which also aims to correct the shortcomings of the FTF-IDF which uses fuzzy logic and its drawbacks. The introduced technique was combined with different ML models to improve the accuracy and relevance of the obtained feature vectors to fed the classification mechanism.
引用
收藏
页码:229 / 249
页数:21
相关论文
共 41 条
[31]  
Smarache, 2003, INFINITE STUDY
[32]  
Smarache F., 2020, COMPUT IND, V115
[33]  
Smarandache F., 2003, P 3 C EUROPEAN SOC F, P141
[34]  
Smarandache F., 2010, J. Defense Resour. Manag. (JoDRM), P107
[35]   STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL [J].
SPARCKJONES, K .
JOURNAL OF DOCUMENTATION, 1972, 28 (01) :11-+
[36]   Introduction to multi-layer feed-forward neural networks [J].
Svozil, D ;
Kvasnicka, V ;
Pospichal, J .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1997, 39 (01) :43-62
[37]   Parameter investigation of support vector machine classifier with kernel functions [J].
Tharwat, Alaa .
KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 61 (03) :1269-1302
[38]  
Wang H., 2010, MULTISPACE MULTISTRU, V4, P410
[39]   Interpreting TF-IDF term weights as making relevance decisions [J].
Wu, Ho Chung ;
Luk, Robert Wing Pong ;
Wong, Kam Fai ;
Kwok, Kui Lam .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2008, 26 (03)
[40]   FUZZY SETS [J].
ZADEH, LA .
INFORMATION AND CONTROL, 1965, 8 (03) :338-&