Reward modeling for mitigating toxicity in transformer-based language models

被引：0

作者：

Farshid Faal

Ketra Schmitt

Jia Yuan Yu

机构：

[1] Concordia University,Concordia Institute for Information System Engineering

[2] Concordia University,Centre for Engineering in Society

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.

引用

页码：8421 / 8435

页数：14

共 50 条

[41] Performance Comparison of Transformer-Based Models on Twitter Health Mention Classification
Khan, Pervaiz Iqbal
Razzak, Imran
Dengel, Andreas
Ahmed, Sheraz
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2023, 10 (03) : 1140 - 1149
[42] PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition
Wang, Yuxin
Xie, Hongtao
Fang, Shancheng
Xing, Mengting
Wang, Jing
Zhu, Shenggao
Zhang, Yongdong
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5585 - 5598
[43] Transformer-Based Microbubble Localization
Gharamaleki, Sepideh K.
Helfield, Brandon
Rivaz, Hassan
2022 IEEE INTERNATIONAL ULTRASONICS SYMPOSIUM (IEEE IUS), 2022,
[44] Incorporating Relative Position Information in Transformer-Based Sign Language Recognition and Translation
Aloysius, Neena
Geetha, M.
Nedungadi, Prema
IEEE ACCESS, 2021, 9 : 145929 - 145942
[45] High entropy alloy property predictions using a transformer-based language model
Spyros Kamnis
Konstantinos Delibasis
Scientific Reports, 15 (1)
[46] Automatic Question Generation using RNN-based and Pre-trained Transformer-based Models in Low Resource Indonesian Language
Vincentio, Karissa
Suhartono, Derwin
INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (07): : 103 - 118
[47] A transformer-based approach to Nigerian Pidgin text generation
Garba, Kabir
Kolajo, Taiwo
Agbogun, Joshua B.
International Journal of Speech Technology, 2024, 27 (04) : 1027 - 1037
[48] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
Ponnambalam, Satheesh Kumar
Desai, Darshana
ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167
[49] Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak
Lehecka, Jan
Psutka, Josef, V
Psutka, Josef
TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 328 - 338
[50] Optimizing Performance of Transformer-based Models for Fetal Brain MR Image Segmentation
Pecco, Nicoll
Della Rosa, Pasquale Anthony
Canini, Matteo
Nocera, Gianluca
Scifo, Paola
Cavoretto, Paolo Ivo
Candiani, Massimo
Falini, Andrea
Castellano, Antonella
Baldoli, Cristina
RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2024, 6 (06)

← 1 2 3 4 5 →