A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

被引:13
|
作者
Yang, Xi [1 ,2 ]
Yang, Kaiwen [1 ]
Cui, Tianxu [1 ]
Chen, Min [1 ]
He, Liyan [1 ]
机构
[1] Beijing Wuzi Univ, Sch Informat, Beijing 101149, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
text vectorization; topic model; pretrained model; transfer learning; SELF-ATTENTION; LATENT; CLASSIFICATION; NEWS;
D O I
10.3390/pr10020350
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Topic Network: Topic Model with Deep Learning for Image Classification
    Pan, Zhiyong
    Liu, Yang
    Liu, Guojun
    Guo, Maozu
    Li, Yang
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2015, 2015, 9403 : 525 - 534
  • [32] The Information content of annual report and cost of equity capital: Based on Text Vectorization Method
    Shi, Zhan
    Fan, Chongjun
    FINANCE RESEARCH LETTERS, 2024, 70
  • [33] Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching
    Qu, Chen
    Ji, Feng
    Qiu, Minghui
    Yang, Liu
    Min, Zhiyu
    Chen, Haiqing
    Huang, Jun
    Croft, W. Bruce
    PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 699 - 707
  • [34] Topic network: topic model with deep learning for image classification
    Pan, Zhiyong
    Liu, Yang
    Liu, Guojun
    Guo, Maozu
    Li, Yang
    JOURNAL OF ELECTRONIC IMAGING, 2018, 27 (03)
  • [35] I-Topic: An Image-text Topic Modeling Method Based on Community Detection
    Liu, Jiapeng
    Zhang, Leihan
    Yan, Qiang
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 797 - 800
  • [36] Transfer Learning for Classifying Spanish and English Text by Clinical Specialties
    Pomares-Quimbaya, Alexandra
    Lopez-Ubeda, Pilar
    Schulz, Stefan
    PUBLIC HEALTH AND INFORMATICS, PROCEEDINGS OF MIE 2021, 2021, 281 : 377 - 381
  • [37] Optimizing Deep Learning for Computer-Aided Diagnosis of Lung Diseases: An Automated Method Combining Evolutionary Algorithm, Transfer Learning, and Model Compression
    Louati, Hassen
    Louati, Ali
    Kariri, Elham
    Bechikh, Slim
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2024, 138 (03): : 2519 - 2547
  • [38] Tr-SLDA: A Transfer Topic Model for Cross-Domains
    Tang H.-L.
    Zheng H.
    Liu Y.-H.
    Ma S.-Y.
    Dou Q.-S.
    Lu M.-Y.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (03): : 605 - 613
  • [39] Text Classification of Network Pyramid Scheme based on Topic Model
    Mu, Pengyu
    He, Jingsha
    Zhu, Nafei
    NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, 2019, : 15 - 19
  • [40] Turkish Text Classification with Machine Learning and Transfer Learning
    Aydogan, Murat
    Karci, Ali
    2019 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP 2019), 2019,