A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

被引:13
|
作者
Yang, Xi [1 ,2 ]
Yang, Kaiwen [1 ]
Cui, Tianxu [1 ]
Chen, Min [1 ]
He, Liyan [1 ]
机构
[1] Beijing Wuzi Univ, Sch Informat, Beijing 101149, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
text vectorization; topic model; pretrained model; transfer learning; SELF-ATTENTION; LATENT; CLASSIFICATION; NEWS;
D O I
10.3390/pr10020350
中图分类号
TQ [化学工业];
学科分类号
0817 ;
摘要
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Text Error Correction Method in the Construction Industry Based on Transfer Learning
    Hou, Zhenguo
    Yang, Weitao
    He, Haiying
    Zhang, Peicong
    Wang, Ziyu
    Ji, Xiaosheng
    COMMUNICATIONS AND NETWORKING (CHINACOM 2021), 2022, : 277 - 290
  • [22] Text Classification Method Based On Semi-Supervised Transfer Learning
    Yu, Xiaosheng
    Zhang, Hehuan
    Li, Jing
    2021 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2021), 2021, : 388 - 394
  • [23] Topic Representation: A Novel Method of Tag Recommendation for Text
    Zhong, Shangru
    Lei, Kai
    Huang, Xiaohui
    Wu, Jincheng
    2017 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2017, : 671 - 676
  • [24] Model and Data Integrated Transfer Learning for Unstructured Map Text Detection
    Zhai, Yanrui
    Zhou, Xiran
    Li, Honghao
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2023, 12 (03)
  • [25] Causality Model for Text Data with a Hierarchical Topic Structure
    Ogawa, Takuro
    Shimadzu, Hideyasu
    Saga, Ryosuke
    2020 25TH INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2020), 2020, : 205 - 210
  • [26] Semantic Augmented Topic Model over Short Text
    Li, Lingyun
    Sun, Yawei
    Wang, Cong
    PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 652 - 656
  • [27] Classification of Text Documents Based on a Probabilistic Topic Model
    Karpovich, S. N.
    Smirnov, A. V.
    Teslya, N. N.
    SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING, 2019, 46 (05) : 314 - 320
  • [28] Classification of Text Documents Based on a Probabilistic Topic Model
    S. N. Karpovich
    A. V. Smirnov
    N. N. Teslya
    Scientific and Technical Information Processing, 2019, 46 : 314 - 320
  • [29] Short text optimized topic model for service clustering
    Lu J.-W.
    Zheng J.-H.
    Li D.-N.
    Xu J.
    Xiao G.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2022, 56 (12): : 2416 - 2425+2444
  • [30] Multilayer vectorization to develop a deeper image feature learning model
    Hemanand, D.
    Bhavani, N. P. G.
    Ayub, Shahanaz
    Ahmad, Mohd Wazih
    Narayanan, S.
    Haldorai, Anandakumar
    AUTOMATIKA, 2023, 64 (02) : 355 - 364