A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

被引:13
|
作者
Yang, Xi [1 ,2 ]
Yang, Kaiwen [1 ]
Cui, Tianxu [1 ]
Chen, Min [1 ]
He, Liyan [1 ]
机构
[1] Beijing Wuzi Univ, Sch Informat, Beijing 101149, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
text vectorization; topic model; pretrained model; transfer learning; SELF-ATTENTION; LATENT; CLASSIFICATION; NEWS;
D O I
10.3390/pr10020350
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] A short text sentiment-topic model for product reviews
    Xiong, Shufeng
    Wang, Kuiyi
    Ji, Donghong
    Wang, Bingkun
    NEUROCOMPUTING, 2018, 297 : 94 - 102
  • [42] sDTM: A Supervised Bayesian Deep Topic Model for Text Analytics
    Yang, Yi
    Zhang, Kunpeng
    Fan, Yangyang
    INFORMATION SYSTEMS RESEARCH, 2023, 34 (01) : 137 - 156
  • [43] Short text classification using semantically enriched topic model
    Uddin, Farid
    Chen, Yibo
    Zhang, Zuping
    Huang, Xin
    JOURNAL OF INFORMATION SCIENCE, 2025, 51 (02) : 481 - 498
  • [44] Design of educational method classification model based on improved multi-label transfer learning model
    Zeng, Chanjuan
    Zhao, Chunhui
    SOFT COMPUTING, 2023,
  • [45] Combining generative adversarial networks and agricultural transfer learning for weeds identification
    Espejo-Garcia, Borja
    Mylonas, Nikos
    Athanasakos, Loukas
    Vali, Eleanna
    Fountas, Spyros
    BIOSYSTEMS ENGINEERING, 2021, 204 : 79 - 89
  • [46] Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis
    Wankmueller, Sandra
    SOCIOLOGICAL METHODS & RESEARCH, 2024, 53 (04) : 1676 - 1752
  • [47] Transfer Learning Method for Very Deep CNN for Text Classification and Methods for its Evaluation
    Moriya, Shun
    Shibata, Chihiro
    2018 IEEE 42ND ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC 2018), VOL 2, 2018, : 153 - 158
  • [48] ADD A TOPIC FEATURE TO LEARNING TO RANK MODEL
    Wan, Li
    Yang, Wen
    2016 13TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2016, : 200 - 203
  • [49] An Optimization Framework for Combining Ensembles of Classifiers and Clusterers with Applications to Nontransductive Semisupervised Learning and Transfer Learning
    Acharya, Ayan
    Hruschka, Eduardo R.
    Ghosh, Joydeep
    Acharyya, Sreangsu
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2014, 9 (01)
  • [50] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
    Raffel, Colin
    Shazeer, Noam
    Roberts, Adam
    Lee, Katherine
    Narang, Sharan
    Matena, Michael
    Zhou, Yanqi
    Li, Wei
    Liu, Peter J.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21