Identifying potentially excellent publications using a citation-based machine learning approach

被引:18
作者
Hu, Zewen [1 ]
Cui, Jingjing [1 ]
Lin, Angela [2 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Management Sci & Engn, Nanjing 210044, Peoples R China
[2] Univ Sheffield, Informat Sch, Sheffield S10 2TN, England
关键词
Machine learning; Artificial intelligence; Excellent papers; Highly cited papers; Sleeping beauty; Citation -based measures; Citation peak; Neural network; LightGBM; TabNet; HIGHLY CITED PAPERS; SLEEPING BEAUTIES; IMPACT; COUNTS; PREDICTION; PATTERNS; FEATURES; PROBE;
D O I
10.1016/j.ipm.2023.103323
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Excellent research papers are vital to science and technology advances. Thus, the early identification of potentially excellent research papers and recognizing their value in science and technology is high on the research agenda. This study used a set of 5 static and 8 time-dependent citation features to explore six machine learning methods and identify the method with the best performance to identify potentially excellent papers. The study modelled Random Forest, LightGBM, Naive Bayes, Support Vector Machine, Neural Network, and TabNet to identify PEPs in the artificial intelligence field. The study defined highly cited papers using the threshold of the top 1% and top 5% and collected the data from the Web of Science (R). Bibliometric and citation data from 485,041 research articles, proceeding papers, and reviews published in AI between 1990 and 2010 were collected initially. The data was screened and processed, and the final dataset consists of 96,169 papers for the training and test sets. The findings suggest that the timedependent citation features are more important than the static features, and citation peak features are more significant than the citation features in identifying potentially excellent papers. The findings demonstrate the effect of threshold on machine learning outcomes (e.g., the top 1% and 5%); therefore, the study argues that the decision about threshold selection should be carefully made. LightGBM and Random Forest both performed with the given conditions and achieved the same score in accuracy and recall. Nevertheless, when comparing their performance in other indicators, such as F1 and cross-entropy loss, LightGBM performed better. The study concluded that LightGBM was the best-performing model for identifying potentially excellent papers. The papers identified the contributions and recommended future research.
引用
收藏
页数:22
相关论文
共 62 条
[1]   Predicting publication long-term impact through a combination of early citations and journal impact factor [J].
Abramo, Giovanni ;
D'Angelo, Ciriaco Andrea ;
Felici, Giovanni .
JOURNAL OF INFORMETRICS, 2019, 13 (01) :32-49
[2]   Predicting citation counts based on deep neural network learning techniques [J].
Abrishami, Ali ;
Aliakbary, Sadegh .
JOURNAL OF INFORMETRICS, 2019, 13 (02) :485-499
[3]   Early indicators of scientific impact: Predicting citations with altmetrics [J].
Akella, Akhil Pandey ;
Alhoori, Hamed ;
Kondamudi, Pavan Ravikanth ;
Freeman, Cole ;
Zhou, Haiming .
JOURNAL OF INFORMETRICS, 2021, 15 (02)
[4]   The effect of highly cited papers on national citation indicators [J].
Aksnes, DW ;
Sivertsen, G .
SCIENTOMETRICS, 2004, 59 (02) :213-224
[5]   Characteristics of highly cited papers [J].
Aksnes, DW .
RESEARCH EVALUATION, 2003, 12 (03) :159-170
[6]  
Arik SO, 2021, AAAI CONF ARTIF INTE, V35, P6679
[8]  
AVRAMESCU A, 1979, J AM SOC INFORM SCI, V30, P296, DOI 10.1002/asi.4630300509
[9]   An evaluation of percentile measures of citation impact, and a proposal for making them better [J].
Bornmann, Lutz ;
Williams, Richard .
SCIENTOMETRICS, 2020, 124 (02) :1457-1478
[10]   How are excellent (highly cited) papers defined in bibliometrics? A quantitative analysis of the literature [J].
Bornmann, Lutz .
RESEARCH EVALUATION, 2014, 23 (02) :166-173