Text classification framework for short text based on TFIDF-FastText

被引:0
作者
Shrutika Chawla
Ravreet Kaur
Preeti Aggarwal
机构
[1] Panjab University,CSE Department, University Institute of Engineering and Technology(UIET)
来源
Multimedia Tools and Applications | 2023年 / 82卷
关键词
Text classification; TFIDF; FastText; LGBM; Short text similarity; Paraphrasing;
D O I
暂无
中图分类号
学科分类号
摘要
Text classification is an issue of high priority in text mining, information retrieval that needs to address the problem of capturing the semantic information of the text. However, several approaches are used to detect the similarity in short sentences, most of these miss the semantic information. This paper introduces a hybrid framework to classify semantically similar short texts from a given set of documents. A real-life dataset – Quora Question Pairs is used for this purpose. In the proposed framework, the question pairs of short texts are pre-processed to eliminate junk information and 25 tokens, and string-equivalence features are engineered from the dataset, which plays a major role in classification. The redundant and overlapping features are removed and word vectors are created by using TF-IDF weighted average FastText approach. A 623-dimensional data model is obtained combining all the obtained features, and the same is then fed to the Light Gradient Boosting Machine for classification. At last, the hyperparameters are tuned to attain optimized log_loss. The experimental results show that the proposed framework can achieve 81.47% accuracy which is at par with the other state-of-art models.
引用
收藏
页码:40167 / 40180
页数:13
相关论文
共 23 条
[1]  
Alzamzami F(2020)Light gradient boosting machine for general sentiment classification on short texts: a comparative evaluation IEEE Access 8 101840-101858
[2]  
Hoda M(2021)Performance comparison of TF-IDF and Word2Vec models for emotion text classification Bull Electr Eng Inform 10 2780-2788
[3]  
El Saddik A(2014)New naive Bayes text classification algorithm Shuju Caiji Yu Chuli/Journal Data Acquis Process 29 71-75
[4]  
Cahyani DE(2018)On the universality of the logistic loss function IEEE Int Symp Inf Theory - Proc 2018 936-940
[5]  
Patasik I(2021)Survey of tools and techniques for sentiment analysis of social networking data Int J Adv Comput Sci Appl 12 222-232
[6]  
Di P(2019)Identification of duplication in questions posed on knowledge sharing platform quora using machine learning techniques Int J Innovative Technol Exploring Eng (IJITEE) 8 2444-2451
[7]  
Duan L(1975)A vector space model for automatic indexing Commun ACM 18 613-620
[8]  
Painsky A(2019)Improving term weighting schemes for short text classification in Vector Space Model IEEE Access 7 166578-166592
[9]  
Wornell G(1990)The strength of weak learnability Mach Learn 5 197-227
[10]  
Rani S(undefined)undefined undefined undefined undefined-undefined