Comparing Vectorization Techniques, Supervised and Unsupervised Classification Methods for Scientific Publication Categorization in the UNESCO Taxonomy

被引:1
作者
Villamizar, Neil [1 ]
Wahrman, Jesus [1 ]
Villasana, Minaya [1 ]
机构
[1] Univ Simon Bolivar, Caracas, Venezuela
来源
ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2023, PT I | 2023年 / 675卷
关键词
Classification; Natural language processing; Scientific texts; Machine learning;
D O I
10.1007/978-3-031-34111-3_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A comparison of classification strategies for scientific articles using the UNESCO taxonomy for categorization is presented. An annotated set of articles were vectorized using TF-IDF, Doc2Vec, BERT y SPECTER and it was established that among those options SPECTER provided the best separability properties using quantitative metrics as well as qualitative inspection of 2D projections using t-SNE. When pairing the best performing vectorization strategy with classical machine learning strategies for the classification task, such as multiple layer perceptron and support vector machines, comparable results are found, concluding that the choice of text representation strategy exerts a greater impact over the choice of classifier. The most problematic areas for classification were identified and a cascading classification strategy was implemented and evaluated. Unsupervised methods were also tested to consider the case when annotated data is not readily available and test their suitability. Two different unsupervised methods were used and it was determined that k-means yielded the best results when considering 3 times the number of categories as the optimal number of clusters.
引用
收藏
页码:356 / 368
页数:13
相关论文
共 20 条
[1]  
Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[2]   The Contribution of Spanish Science to Patents: Medicine as Case of Study [J].
Cascajares, Mila ;
Alcayde, Alfredo ;
Antonio Garrido-Cardenas, Jose ;
Manzano-Agugliaro, Francisco .
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2020, 17 (10)
[3]  
Cohan A, 2020, Arxiv, DOI arXiv:2004.07180
[4]   CLUSTER SEPARATION MEASURE [J].
DAVIES, DL ;
BOULDIN, DW .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1979, 1 (02) :224-227
[5]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[6]  
Harabasz J., 1974, COMMUN STAT, V3, P1, DOI DOI 10.1080/03610927408827101
[7]  
Klabunde R., 2002, Zeitschrift fur Sprachwissenschaft, V21, P106
[8]  
Mikolov T, 2013, Arxiv, DOI [arXiv:1301.3781, 10.48550/arXiv.1301.3781]
[9]  
Mthembu L, 2008, Arxiv, DOI [arXiv:0812.1107, 10.48550/ARXIV.0812.1107, DOI 10.48550/ARXIV.0812.1107]
[10]   What is a support vector machine? [J].
Noble, William S. .
NATURE BIOTECHNOLOGY, 2006, 24 (12) :1565-1567