Reward modeling for mitigating toxicity in transformer-based language models

被引：0

作者：

Farshid Faal

Ketra Schmitt

Jia Yuan Yu

机构：

[1] Concordia University,Concordia Institute for Information System Engineering

[2] Concordia University,Centre for Engineering in Society

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.

引用

页码：8421 / 8435

页数：14

共 50 条

[21] Transformer-Based Federated Learning Models for Recommendation Systems
Reddy, M. Sujaykumar
Karnati, Hemanth
Sundari, L. Mohana
IEEE ACCESS, 2024, 12 : 109596 - 109607
[22] Smart Home Notifications in Croatian Language: A Transformer-Based Approach
Simunec, Magdalena
Soic, Renato
2023 17TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, CONTEL, 2023,
[23] Transformer-Based Single-Cell Language Model: A Survey
Lan, Wei
He, Guohang
Liu, Mingyang
Chen, Qingfeng
Cao, Junyue
Peng, Wei
BIG DATA MINING AND ANALYTICS, 2024, 7 (04): : 1169 - 1186
[24] AI-Assisted Text Composition for Automated Content Authoring Using Transformer-Based Language Models
Alpdemir, Yusuf
Alpdemir, Mahmut Nedim
2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED SYSTEMS AND EMERGENT TECHNOLOGIES, ICASET 2024, 2024,
[25] From Captions to Explanations: A Multimodal Transformer-based Architecture for Natural Language Explanation Generation
Rio-Torto, Isabel
Cardoso, Jaime S.
Teixeira, Luis F.
PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2022), 2022, 13256 : 54 - 65
[26] Transformer-Based Models for the Automatic Indexing of Scientific Documents in French
Angel Gonzalez, Jose
Buscaldi, Davide
Sanchis, Emilio
Hurtado, Lluis-F
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 60 - 72
[27] An Ensemble of Arabic Transformer-based Models for Arabic Sentiment Analysis
El Karfi, Ikram
El Fkihi, Sanaa
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (08) : 561 - 567
[28] Enriching Transformer-Based Embeddings for Emotion Identification in an Agglutinative Language: Turkish
Uymaz, Hande Aka
Metin, Senem Kumova
IT PROFESSIONAL, 2023, 25 (04) : 67 - 73
[29] ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit
Kang, Myeonggu
Park, Junyoung
Shin, Hyein
Shin, Jaekang
Kim, Lee-Sup
IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (09) : 2248 - 2261
[30] A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning
Kotei, Evans
Thirunavukarasu, Ramkumar
INFORMATION, 2023, 14 (03)

← 1 2 3 4 5 →