Overview of the PAN@FIRE 2020 Task on the Authorship Identification of SOurce COde

被引:0
作者
Fadel, Ali [1 ]
Musleh, Husam [1 ]
Tuffaha, Ibraheem [1 ]
Al-Ayyoub, Mahmoud [1 ]
Jararweh, Yaser [2 ]
Benkhelifa, Elhadj [3 ]
机构
[1] Jordan Univ Sci & Technol, Irbid, Jordan
[2] Duquesne Univ, Pittsburgh, PA 15219 USA
[3] Staffordshire Univ, Stoke On Trent, Staffs, England
来源
PROCEEDINGS OF THE 12TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2020) | 2020年
关键词
authorship-identification; source-code; datasets;
D O I
10.1145/3441501.3441532
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of SOurce COde (AI-SOCO), is proposed with the focus on the identification of source code authors. The dataset consists of crawled source codes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the CodeForces online judge platform. The participating systems are asked to predict the author of a given source code from the predefined list of code authors. In total, 60 teams registered on the task's CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle source code) to stylometric features.
引用
收藏
页码:4 / 8
页数:5
相关论文
共 11 条
  • [1] Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
    Carneiro, Tiago
    Medeiros Da Nobrega, Raul Victor
    Nepomuceno, Thiago
    Bian, Gui-Bin
    De Albuquerque, Victor Hugo C.
    Reboucas Filho, Pedro Pedrosa
    [J]. IEEE ACCESS, 2018, 6 : 61677 - 61685
  • [2] Crosby Alexander, 2020, P 12 M FORUM INFORM
  • [3] Flores Enrique, 2015, POSTPROCEEDINGS WORK, V1587, P1
  • [4] Guo GD, 2003, LECT NOTES COMPUT SC, V2888, P986
  • [5] Liu YH, 2019, Arxiv, DOI [arXiv:1907.11692, 10.48550/ARXIV.1907.11692, DOI 10.48550/ARXIV.1907.11692]
  • [6] An introduction to logistic regression analysis and reporting
    Peng, CYJ
    Lee, KL
    Ingersoll, GM
    [J]. JOURNAL OF EDUCATIONAL RESEARCH, 2002, 96 (01) : 3 - 14
  • [7] Radford A., 2019, OpenAI Blog, V1, P9, DOI DOI 10.18653/V1/P19-1195
  • [8] Rangel Francisco, 2016, WORKING NOTES FIRE 2, V1737, P1
  • [9] TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL
    SALTON, G
    BUCKLEY, C
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) : 513 - 523
  • [10] Vaswani A, 2017, ADV NEUR IN, V30