Reward modeling for mitigating toxicity in transformer-based language models

被引：0

作者：

Farshid Faal

Ketra Schmitt

Jia Yuan Yu

机构：

[1] Concordia University,Concordia Institute for Information System Engineering

[2] Concordia University,Centre for Engineering in Society

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.

引用

页码：8421 / 8435

页数：14

共 50 条

[1] Reward modeling for mitigating toxicity in transformer-based language models
Faal, Farshid
Schmitt, Ketra
Yu, Jia Yuan
APPLIED INTELLIGENCE, 2023, 53 (07) : 8421 - 8435
[2] Transformer-based language models for mental health issues: A survey
Greco, Candida M.
Simeri, Andrea
Tagarelli, Andrea
Zumpano, Ester
PATTERN RECOGNITION LETTERS, 2023, 167 : 204 - 211
[3] Quantifying the Bias of Transformer-Based Language Models for African American English in Masked Language Modeling
Salutari, Flavia
Ramos, Jerome
Rahmani, Hossein A.
Linguaglossa, Leonardo
Lipani, Aldo
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 : 532 - 543
[4] AMMU: A survey of transformer-based biomedical pretrained language models
Kalyan, Katikapalli Subramanyam
Rajasekharan, Ajit
Sangeetha, Sivanesan
JOURNAL OF BIOMEDICAL INFORMATICS, 2022, 126
[5] Pre-trained transformer-based language models for Sundanese
Wilson Wongso
Henry Lucky
Derwin Suhartono
Journal of Big Data, 9
[6] Pre-trained transformer-based language models for Sundanese
Wongso, Wilson
Lucky, Henry
Suhartono, Derwin
JOURNAL OF BIG DATA, 2022, 9 (01)
[7] Automatic text summarization using transformer-based language models
Rao, Ritika
Sharma, Sourabh
Malik, Nitin
INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (06) : 2599 - 2605
[8] Transformer-Based Composite Language Models for Text Evaluation and Classification
Skoric, Mihailo
Utvic, Milos
Stankovic, Ranka
MATHEMATICS, 2023, 11 (22)
[9] Pre-training and Evaluating Transformer-based Language Models for Icelandic
Daoason, Jon Friorik
Loftsson, Hrafn
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7386 - 7391
[10] Enhancing Address Data Integrity using Transformer-Based Language Models
Kurklu, Omer Faruk
Akagiunduz, Erdem
32ND IEEE SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU 2024, 2024,

← 1 2 3 4 5 →