Classification of human- and AI-generated texts for different languages and domains

被引:0
|
作者
Kristina Schaaff [1 ]
Tim Schlippe [1 ]
Lorenz Mindner [1 ]
机构
[1] IU International University of Applied Sciences,
关键词
Generative AI; ChatGPT; Natural language processing; Features; Prompting; Artificial intelligence; Text classification;
D O I
10.1007/s10772-024-10143-3
中图分类号
学科分类号
摘要
Chatbots based on large language models (LLMs) like ChatGPT are available to the wide public. These tools can for instance be used by students to generate essays or whole theses from scratch or by rephrasing an existing text. But how does for instance a teacher know whether a text is written by a student or an AI? In this paper, we investigate perplexity, semantic, list lookup, document, error-based, readability, AI feedback and text vector features to classify human-generated and AI-generated texts from the educational domain as well as news articles. We analyze two scenarios: (1) The detection of text generated by AI from scratch, and (2) the detection of text rephrased by AI. Since we assumed that classification is more difficult when the AI has been prompted to create or rephrase the text in a way that a human would not recognize that it was generated or rephrased by an AI, we also investigate this advanced prompting scenario. To train, fine-tune and test the classifiers, we created the Multilingual Human-AI-Generated Text Corpus which contains human-generated, AI-generated and AI-rephrased texts from the educational domain in English, French, German, and Spanish and English texts from the news domain. We demonstrate that the same features can be used for the detection of AI-generated and AI-rephrased texts from the educational domain in all languages and the detection of AI-generated and AI-rephrased news texts. Our best systems significantly outperform GPTZero and ZeroGPT—state-of-the-art systems for the detection of AI-generated text. Our best text rephrasing detection system even outperforms GPTZero by 181.3% relative in F1-score.
引用
收藏
页码:935 / 956
页数:21
相关论文
共 39 条
  • [1] Towards Detection of AI-Generated Texts and Misinformation
    Najee-Ullah, Ahmad
    Landeros, Luis
    Balytskyi, Yaroslav
    Chang, Sang-Yoon
    SOCIO-TECHNICAL ASPECTS IN SECURITY, STAST 2021, 2022, 13176 : 194 - 205
  • [2] The imitation game: Detecting human and AI-generated texts in the era of ChatGPT and BARD
    Hayawi, Kadhim
    Shahriar, Sakib
    Mathew, Sujith Samuel
    JOURNAL OF INFORMATION SCIENCE, 2024,
  • [3] AI-generated vs human-authored texts: A multidimensional comparison
    Sardinha, Tony Berber
    APPLIED CORPUS LINGUISTICS, 2024, 4 (01):
  • [4] Can novice teachers detect AI-generated texts in EFL writing?
    De Wilde, Vanessa
    ELT JOURNAL, 2024, 78 (04) : 414 - 422
  • [5] Towards AI-Generated Essay Classification Using Numerical Text Representation
    Krawczyk, Natalia
    Probierz, Barbara
    Kozak, Jan
    APPLIED SCIENCES-BASEL, 2024, 14 (21):
  • [6] CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images
    Bird, Jordan J.
    Lotfi, Ahmad
    IEEE ACCESS, 2024, 12 : 15642 - 15650
  • [7] Detecting and assessing AI-generated and human-produced texts: The case of second language writing teachers
    Nguyen, Loc
    Barrot, Jessie S.
    ASSESSING WRITING, 2024, 62
  • [8] Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis
    Hakam, Hassan Tarek
    Prill, Robert
    Korte, Lisa
    Lovrekovi, Bruno
    Ostoji, Marko
    Ramadanov, Nikolai
    Muehlensiepen, Felix
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [9] How Sensitive Are the Free AI-detector Tools in Detecting AI-generated Texts? A Comparison of Popular AI-detector Tools
    Kar, Sujita Kumar
    Bansal, Teena
    Modi, Sumit
    Singh, Amit
    INDIAN JOURNAL OF PSYCHOLOGICAL MEDICINE, 2024,
  • [10] Human vs. Machine: A Comparative Study on the Detection of AI-Generated Content
    Tadjine, Amal bou
    Harrag, Fouzi
    Shaalan, Khaled
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2025, 24 (02)