MACHINE LEARNING APPROACHES TO IDENTIFY AI-GENERATED TEXT: A COMPARATIVE ANALYSIS

被引：0

作者：

Mihir, T. Krishna ^{[1
]}

Harsha, K. Venkata Sai ^{[1
]}

Nitya, S. Yuva ^{[1
]}

Krishna, G. Bala ^{[1
]}

Anamalamudi, Satish ^{[1
]}

Sivarajan, Sruthi ^{[1
]}

机构：

[1] SRM Univ AP, Dept CSE, Amaravati, India

来源：

2024 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND EMERGING COMMUNICATION TECHNOLOGIES, ICEC | 2024年

关键词：

BERT; Fake content detection; Contextual Encoding; Paraphrasing Attacks; Academic Dishonesty;

D O I：

10.1109/ICEC59683.2024.10837481

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale language models (LLMs), such as the GPT-4 from OpenAI and the Pathways Language Model from Google, have become indispensable to our daily lives and work, frequently being used without our conscious knowledge. Though study indicates that even crowdsourcing professionals have trouble identifying human-generated content from AI-generated content, subtle defects in AI writing continue to make this tough to perform. Even though these models have many benefits and the potential to totally change the way people work and learn, they have also drawn a lot of attention due to their potential drawbacks. One notable use of LLMs that demonstrates their potential for task automation is the generation of academic reports or articles with little to no human input. As a result, scientists have concentrated on creating detectors to deal with possible wrongdoing related to information provided by LLM. However, current methods frequently ignore the vital component of generalizability in favor of accuracy on small datasets. Our study offers a thorough examination of machine learning (ML) techniques that are intended to differentiate text produced by artificial intelligence (AI) from language created by humans. We attempt to use a Kaggle dataset in order to gain greater insight into the data collection procedure. To address the challenge of accurately and reliably detecting text produced by artificial intelligence (AI), we explore various machine learning approaches. We examine a broad spectrum of algorithms taking into account performance measures and their generalizability to different datasets and scenarios. By providing a more thorough explanation of the significance of LLMs, the challenges they provide, and the specific focus of our study, we broaden the abstract and lay the foundation for a thorough analysis of ML algorithms for text identification.

引用

页码：411 / 416

页数：6

共 14 条

[1]

Bakhtin A, 2019, Arxiv, DOI arXiv:1906.03351

[2] Computer-Generated Text Detection Using Machine Learning: A Systematic Review [J].

Beresneva, Daria .

NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 :421-426

[3] Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods [J].

Crothers, Evan N. ;

Japkowicz, Nathalie ;

Viktor, Herna L. .

IEEE ACCESS, 2023, 11 :70977-71002

[4]

Ding SY, 2021, Arxiv, DOI [arXiv:2104.05824, 10.48550/arXiv.2104.05824]

[5] Enhancing Robustness of LLM-Synthetic Text Detectors for Academic Writing: A Comprehensive Analysis [J].

Dou, Zhicheng ;

Guo, Yuchen ;

Chang, Ching-Chun ;

Nguyen, Huy H. ;

Echizen, Isao .

ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 4, AINA 2024, 2024, 202 :266-277

[6]

Gehrmann S, 2019, Arxiv, DOI [arXiv:1906.04043, DOI 10.48550/ARXIV.1906.04043, 10.48550/arXiv.1906.04043]

[7]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[8]

Ippolito D, 2020, Arxiv, DOI [arXiv:1911.00650, DOI 10.48550/ARXIV.1911.00650]

[9]

Jawahar G, 2020, Arxiv, DOI arXiv:2011.01314

[10]

Krishna K, 2024, ADV NEURAL INFORM PR, V36

← 1 2 →