Text classification framework for short text based on TFIDF-FastText

被引:6
作者
Chawla, Shrutika [1 ]
Kaur, Ravreet [1 ]
Aggarwal, Preeti [1 ]
机构
[1] Panjab Univ, Univ Inst Engn & Technol UIET, CSE Dept, Chandigarh, India
关键词
Text classification; TFIDF; FastText; LGBM; Short text similarity; Paraphrasing;
D O I
10.1007/s11042-023-15211-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification is an issue of high priority in text mining, information retrieval that needs to address the problem of capturing the semantic information of the text. However, several approaches are used to detect the similarity in short sentences, most of these miss the semantic information. This paper introduces a hybrid framework to classify semantically similar short texts from a given set of documents. A real-life dataset - Quora Question Pairs is used for this purpose. In the proposed framework, the question pairs of short texts are pre-processed to eliminate junk information and 25 tokens, and string-equivalence features are engineered from the dataset, which plays a major role in classification. The redundant and overlapping features are removed and word vectors are created by using TF-IDF weighted average FastText approach. A 623-dimensional data model is obtained combining all the obtained features, and the same is then fed to the Light Gradient Boosting Machine for classification. At last, the hyperparameters are tuned to attain optimized log_loss. The experimental results show that the proposed framework can achieve 81.47% accuracy which is at par with the other state-of-art models.
引用
收藏
页码:40167 / 40180
页数:14
相关论文
共 20 条
  • [1] Light Gradient Boosting Machine for General Sentiment Classification on Short Texts: A Comparative Evaluation
    Alzamzami, Fatimah
    Hoda, Mohamad
    El Saddik, Abdulmotaleb
    [J]. IEEE ACCESS, 2020, 8 (08): : 101840 - 101858
  • [2] Aslam I, 2021, LECT NOTES DATA ENG, V78, DOI 10.1007
  • [3] Cahyani D.E., 2021, Bull. Electr. Eng. Inform., V10, P2780, DOI [10.11591/eei.v10i5.3157, DOI 10.11591/EEI.V10I5.3157]
  • [4] Di Peng, 2014, Journal of Data Acquisition & Processing, V29, P71
  • [5] Dosilovic FK, 2018, 2018 41ST INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), P210, DOI 10.23919/MIPRO.2018.8400040
  • [6] Fan H., 2018, P 2018 INT C NETW CO, P501
  • [7] Machine Learning Models for Paraphrase Identification and its Applications on Plagiarism Detection
    Hunt, Ethan
    Janamsetty, Ritvik
    Kinares, Chanana
    Koh, Chanel
    Sanchez, Alexis
    Zhan, Felix
    Ozdemir, Murat
    Waseem, Shabnam
    Yolcu, Osman
    Dahal, Binay
    Zhan, Justin
    Gewali, Laxmi
    Oh, Paul
    [J]. 2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019), 2019, : 97 - 104
  • [8] Li BD, 2020, IEEE POW ENER SOC GE
  • [9] ENHANCING DEEP PARAPHRASE IDENTIFICATION VIA LEVERAGING WORD ALIGNMENT INFORMATION
    Li, Boxin
    Liu, Tingwen
    Wang, Bin
    Wang, Lihong
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7843 - 7847
  • [10] Painsky A, 2018, IEEE INT SYMP INFO, P936, DOI 10.1109/ISIT.2018.8437786