Reward modeling for mitigating toxicity in transformer-based language models

被引:0
作者
Farshid Faal
Ketra Schmitt
Jia Yuan Yu
机构
[1] Concordia University,Concordia Institute for Information System Engineering
[2] Concordia University,Centre for Engineering in Society
来源
Applied Intelligence | 2023年 / 53卷
关键词
Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;
D O I
暂无
中图分类号
学科分类号
摘要
Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.
引用
收藏
页码:8421 / 8435
页数:14
相关论文
共 50 条
  • [31] Transformer-based models for combating rumours on microblogging platforms: a review
    Anggrainingsih, Rini
    Hassan, Ghulam Mubashar
    Datta, Amitava
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (08)
  • [32] Transformer-based models to deal with heterogeneous environments in Human Activity Recognition
    Ek S.
    Portet F.
    Lalanda P.
    Personal and Ubiquitous Computing, 2023, 27 (06) : 2267 - 2280
  • [33] Transformer-Based Models for Probabilistic Time Series Forecasting with Explanatory Variables
    Caetano, Ricardo
    Oliveira, Jose Manuel
    Ramos, Patricia
    MATHEMATICS, 2025, 13 (05)
  • [34] Calibration of Transformer-Based Models for Identifying Stress and Depression in Social Media
    Ilias, Loukas
    Mouzakitis, Spiros
    Askounis, Dimitris
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (02) : 1979 - 1990
  • [35] Benchmarking Inference of Transformer-Based Transcription Models With Clustering on Embedded GPUs
    Schubert, Marika E.
    Langerman, David
    George, Alan D.
    IEEE ACCESS, 2024, 12 : 123276 - 123293
  • [36] Evaluation of transformer-based models for punctuation and capitalization restoration in Catalan and Galician
    Pan, Ronghao
    Garcia-Diaz, Jose Antonio
    Vivancos-Vicente, Pedro Jose
    Valencia-Garcia, Rafael
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2023, (70): : 27 - 38
  • [37] End-to-End Transformer-Based Models in Textual-Based NLP
    Rahali, Abir
    Akhloufi, Moulay A.
    AI, 2023, 4 (01) : 54 - 110
  • [38] A bio-inspired positional embedding network for transformer-based models
    Tang, Xue-song
    Hao, Kuangrong
    Wei, Hui
    NEURAL NETWORKS, 2023, 166 : 204 - 214
  • [39] Scaling Implicit Bias Analysis across Transformer-Based Language Models through Embedding Association Test and Prompt Engineering
    Bevara, Ravi Varma Kumar
    Mannuru, Nishith Reddy
    Karedla, Sai Pranathi
    Xiao, Ting
    APPLIED SCIENCES-BASEL, 2024, 14 (08):
  • [40] Performance Comparison of Vision Transformer-Based Models in Medical Image Classification
    Kanca, Elif
    Ayas, Selen
    Kablan, Elif Baykal
    Ekinci, Murat
    2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,