A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification

被引:4
作者
Lefebvre, Guillaume [1 ,2 ]
Elghazel, Haytham [1 ]
Guillet, Theodore [1 ]
Aussem, Alexandre [1 ]
Sonnati, Matthieu [2 ]
机构
[1] Univ Lyon 1, CNRS, UMR 5205, LIRIS, F-69622 Lyon, France
[2] Inokufu, Lyon, France
关键词
NLP; Transformers; Sentence similarity; Sentence embedding; Education and professional training domain; Information retrieval; Classification; Hierarchical Multi-label Classification;
D O I
10.1016/j.datak.2024.102281
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi -label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific -domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi -label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence -embedding method BERTEPro based on existing Transformer models, whose pre -training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain -specific hierarchical multi -label classification. Experiments over three domain -specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.
引用
收藏
页数:15
相关论文
共 48 条
  • [1] Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
  • [2] Cer Daniel, 2017, P 11 INT WORKSHOP SE, P1, DOI [DOI 10.18653/V1/S17-2001, 10.18653/v1/S17-2001]
  • [3] Cho K., 2014, ARXIV, DOI 10.3115/v1/w14-4012
  • [4] Conneau A, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2475
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] French translation of a dialogue dataset and text-based emotion detection
    Genest, Pierre-Yves
    Goix, Laurent-Walter
    Khalafaoui, Yasser
    Egyed-Zsigmond, Elod
    Grozavu, Nistor
    [J]. DATA & KNOWLEDGE ENGINEERING, 2022, 142
  • [7] Goel V, 2022, Arxiv, DOI [arXiv:2205.12335, 10.48550/ARXIV.2205.12335, DOI 10.48550/ARXIV.2205.12335]
  • [8] Goldberg Y, 2014, Arxiv, DOI arXiv:1402.3722
  • [9] Guo Q., 2019, arXiv, DOI DOI 10.48550/ARXIV.1902.09113
  • [10] Gururangan S., 2020, P 58 ANN M ASS COMP, P8342, DOI [DOI 10.18653/V1/2020.ACL-MAIN.740, 10.18653/v1/2020.aclmain.740]