Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

被引:0
作者
Moctezuma, Daniela [1 ]
Lopez-Vazquez, Carlos [2 ]
Lopes, Lucas [3 ]
Trevisan, Norton [3 ]
Perez, Jose [3 ]
机构
[1] Ctr Invest Ciencias Informac Geoespacial, Circuito Tecnopolo Norte,107 Col Tecnopolo Pocitos, Aguascalientes 20313, Mexico
[2] Univ ORT, LatinGEO Lab IGM ORT, Cuareim 1451, Montevideo 11100, Uruguay
[3] Univ Sao Paulo, Sch Arts Sci & Humanities, Sao Paulo, Brazil
关键词
Gold standard; NLP; Machine learning; Agreement's annotator;
D O I
10.1007/s10115-023-02023-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Science as a whole is organized into broad fields, and as a consequence, research, resources, students, etc., are also classified, assigned, or invited following a similar structure. Some fields have been established for centuries, and some others are just flourishing. Funding, staff, etc., to support fields are offered if there is some activity on it, commonly measured in terms of the number of published scientific papers. How to find them? There exist well-respected listings where scientific journals are ascribed to one or more knowledge fields. Such lists are human-made, but the complexity begins when a field covers more than one area of knowledge. How to discern if a particular paper is devoted to a field not considered in such lists? In this work, we propose a methodology able to classify the universe of papers into two classes; those belonging to the field of interest, and those that do not. This proposed procedure learns from the title and abstract of papers published in monothematic or "pure" journals. Provided that such journals exist, the procedure could be applied to any field of knowledge. We tested the process with Geographic Information Science. The field has contacts with Computer Science, Mathematics, Cartography, and others, a fact which makes the task very difficult. We also tested our procedure and analyzed its results with three different criteria, illustrating its power and capabilities. Interesting findings were found, where our proposed solution reached similar results as human taggers also similar results compared with state-of-the-art related work.
引用
收藏
页码:1503 / 1520
页数:18
相关论文
共 42 条
  • [1] Afshar J., 2022, Scientometrics, V1, P1
  • [2] Akhtar Z, 2020, BERT BASE VS BERT LA
  • [3] Bender Emily M., 2020, P 58 ANN M ASS COMP, P5185, DOI DOI 10.18653/V1/2020.ACL-MAIN.463
  • [4] Boesser C.T., 2020, Comparing human and machine learning classification of human factors in incident reports from aviation
  • [5] Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
    Boyack, Kevin W.
    Newman, David
    Duhon, Russell J.
    Klavans, Richard
    Patek, Michael
    Biberstine, Joseph R.
    Schijvenaars, Bob
    Skupin, Andre
    Ma, Nianli
    Boerner, Katy
    [J]. PLOS ONE, 2011, 6 (03):
  • [6] Briggs J, 2021, BERT NEXT SENTENCE P
  • [7] Canete J., 2020, PML4DC ICLR 2020
  • [8] Chen G., 2022, SCIENTOMETRICS, V128, P1
  • [9] Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles
    Chen, Haihua
    Huyen Nguyen
    Alghamdi, Asmaa
    [J]. SCIENTOMETRICS, 2022, 127 (12) : 7061 - 7075
  • [10] Clark S., 2007, COMBINING SYMBOLIC D