Reward modeling for mitigating toxicity in transformer-based language models

被引：0

作者：

Farshid Faal

Ketra Schmitt

Jia Yuan Yu

机构：

[1] Concordia University,Concordia Institute for Information System Engineering

[2] Concordia University,Centre for Engineering in Society

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.

引用

页码：8421 / 8435

页数：14

共 50 条

[31] Transformer-based models for combating rumours on microblogging platforms: a review
Anggrainingsih, Rini
Hassan, Ghulam Mubashar
Datta, Amitava
ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (08)
[32] Transformer-based models to deal with heterogeneous environments in Human Activity Recognition
Ek S.
Portet F.
Lalanda P.
Personal and Ubiquitous Computing, 2023, 27 (06) : 2267 - 2280
[33] Transformer-Based Models for Probabilistic Time Series Forecasting with Explanatory Variables
Caetano, Ricardo
Oliveira, Jose Manuel
Ramos, Patricia
MATHEMATICS, 2025, 13 (05)
[34] Calibration of Transformer-Based Models for Identifying Stress and Depression in Social Media
Ilias, Loukas
Mouzakitis, Spiros
Askounis, Dimitris
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (02) : 1979 - 1990
[35] Benchmarking Inference of Transformer-Based Transcription Models With Clustering on Embedded GPUs
Schubert, Marika E.
Langerman, David
George, Alan D.
IEEE ACCESS, 2024, 12 : 123276 - 123293
[36] Evaluation of transformer-based models for punctuation and capitalization restoration in Catalan and Galician
Pan, Ronghao
Garcia-Diaz, Jose Antonio
Vivancos-Vicente, Pedro Jose
Valencia-Garcia, Rafael
PROCESAMIENTO DEL LENGUAJE NATURAL, 2023, (70): : 27 - 38
[37] End-to-End Transformer-Based Models in Textual-Based NLP
Rahali, Abir
Akhloufi, Moulay A.
AI, 2023, 4 (01) : 54 - 110
[38] A bio-inspired positional embedding network for transformer-based models
Tang, Xue-song
Hao, Kuangrong
Wei, Hui
NEURAL NETWORKS, 2023, 166 : 204 - 214
[39] Scaling Implicit Bias Analysis across Transformer-Based Language Models through Embedding Association Test and Prompt Engineering
Bevara, Ravi Varma Kumar
Mannuru, Nishith Reddy
Karedla, Sai Pranathi
Xiao, Ting
APPLIED SCIENCES-BASEL, 2024, 14 (08):
[40] Performance Comparison of Vision Transformer-Based Models in Medical Image Classification
Kanca, Elif
Ayas, Selen
Kablan, Elif Baykal
Ekinci, Murat
2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,

← 1 2 3 4 5 →