Transductive transfer learning based Genetic Programming for balanced and unbalanced document classification using different types of features

被引:4
作者
Fu, Wenlong [1 ]
Xue, Bing [1 ]
Gao, Xiaoying [1 ]
Zhang, Mengjie [1 ]
机构
[1] Sch Engn & Comp Sci, POB 600, Wellington 6140, New Zealand
关键词
Genetic Programming; Document classification; Transfer learning; TEXT CLASSIFICATION; REPRESENTATIONS; WORDS; IDF;
D O I
10.1016/j.asoc.2021.107172
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document classification is one of the predominant tasks in Natural Language Processing. However, some document classification tasks do not have ground truth while other similar datasets may have ground truth. Transfer learning can utilize similar datasets with ground truth to train effective classifiers on the dataset without ground truth. This paper introduces a transductive transfer learning method for document classification using two different text feature representations?the term frequency (TF) and the semantic feature doc2vec. It has three main contributions. First, it enables the sharing knowledge in a dataset using TF and a dataset using doc2vec in transductive transfer learning for performance improvement. Second, it demonstrates that the partially learned programs from TFs and from doc2vecs can be alternatively used to ?label then learn?and they improve each other. Lastly, it addresses the unbalanced dataset problem by considering the unbalanced distributions on categories for evolving proper Genetic Programming (GP) programs on the target domains. Our experimental results on two popular document datasets show that the proposed technique effectively transfers knowledge from the GP programs evolved from the source domains to the new GP programs on the target domains using TF or doc2vec. There are obviously more than 10 percentages improvement achieved by the GP programs evolved by the proposed method over the GP programs directly evolved from the source domains. Also, the proposed technique effectively utilizes GP programs evolved from unbalanced datasets (on the source and target domains) to evolve new GP programs on the target domains, which balances predictions on different categories. (C) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 52 条
  • [1] Text Classification Using Machine Learning Methods-A Survey
    Agarwal, Basant
    Mittal, Namita
    [J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2012), 2014, 236 : 701 - 709
  • [2] Semantic text classification: A survey of past and recent advances
    Altinel, Berna
    Ganiz, Murat Can
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2018, 54 (06) : 1129 - 1153
  • [3] [Anonymous], 2015, P 2015 C EMP METH NA, DOI 10.18653/v1/D15-1177
  • [4] [Anonymous], 2013, 1 INT C LEARN REPR I
  • [5] Bojanowski P., 2017, T ASSOC COMPUT LING, V5, P135, DOI [10.1162/tacl_a_00051, DOI 10.1162/TACLA00051]
  • [6] Unsupervised Transfer Learning via Multi-Scale Convolutional Sparse Coding for Biomedical Applications
    Chang, Hang
    Han, Ju
    Zhong, Cheng
    Snijders, Antoine M.
    Mao, Jian-Hua
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1182 - 1194
  • [7] Dai WY, 2007, KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P210
  • [8] Darwish Saad M., 2015, Journal of Advances in Information Technology, V6, P194, DOI 10.12720/jait.6.4.194-200
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] A Survey on the Application of Genetic Programming to Classification
    Espejo, Pedro G.
    Ventura, Sebastian
    Herrera, Francisco
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2010, 40 (02): : 121 - 144