Reward modeling for mitigating toxicity in transformer-based language models

被引:0
作者
Farshid Faal
Ketra Schmitt
Jia Yuan Yu
机构
[1] Concordia University,Concordia Institute for Information System Engineering
[2] Concordia University,Centre for Engineering in Society
来源
Applied Intelligence | 2023年 / 53卷
关键词
Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;
D O I
暂无
中图分类号
学科分类号
摘要
Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.
引用
收藏
页码:8421 / 8435
页数:14
相关论文
共 50 条
  • [41] Performance Comparison of Transformer-Based Models on Twitter Health Mention Classification
    Khan, Pervaiz Iqbal
    Razzak, Imran
    Dengel, Andreas
    Ahmed, Sheraz
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2023, 10 (03) : 1140 - 1149
  • [42] PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition
    Wang, Yuxin
    Xie, Hongtao
    Fang, Shancheng
    Xing, Mengting
    Wang, Jing
    Zhu, Shenggao
    Zhang, Yongdong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5585 - 5598
  • [43] Transformer-Based Microbubble Localization
    Gharamaleki, Sepideh K.
    Helfield, Brandon
    Rivaz, Hassan
    2022 IEEE INTERNATIONAL ULTRASONICS SYMPOSIUM (IEEE IUS), 2022,
  • [44] Incorporating Relative Position Information in Transformer-Based Sign Language Recognition and Translation
    Aloysius, Neena
    Geetha, M.
    Nedungadi, Prema
    IEEE ACCESS, 2021, 9 : 145929 - 145942
  • [45] High entropy alloy property predictions using a transformer-based language model
    Spyros Kamnis
    Konstantinos Delibasis
    Scientific Reports, 15 (1)
  • [46] Automatic Question Generation using RNN-based and Pre-trained Transformer-based Models in Low Resource Indonesian Language
    Vincentio, Karissa
    Suhartono, Derwin
    INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (07): : 103 - 118
  • [47] A transformer-based approach to Nigerian Pidgin text generation
    Garba, Kabir
    Kolajo, Taiwo
    Agbogun, Joshua B.
    International Journal of Speech Technology, 2024, 27 (04) : 1027 - 1037
  • [48] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
    Ponnambalam, Satheesh Kumar
    Desai, Darshana
    ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167
  • [49] Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak
    Lehecka, Jan
    Psutka, Josef, V
    Psutka, Josef
    TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 328 - 338
  • [50] Optimizing Performance of Transformer-based Models for Fetal Brain MR Image Segmentation
    Pecco, Nicoll
    Della Rosa, Pasquale Anthony
    Canini, Matteo
    Nocera, Gianluca
    Scifo, Paola
    Cavoretto, Paolo Ivo
    Candiani, Massimo
    Falini, Andrea
    Castellano, Antonella
    Baldoli, Cristina
    RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2024, 6 (06)