Reward modeling for mitigating toxicity in transformer-based language models

被引:0
作者
Farshid Faal
Ketra Schmitt
Jia Yuan Yu
机构
[1] Concordia University,Concordia Institute for Information System Engineering
[2] Concordia University,Centre for Engineering in Society
来源
Applied Intelligence | 2023年 / 53卷
关键词
Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;
D O I
暂无
中图分类号
学科分类号
摘要
Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.
引用
收藏
页码:8421 / 8435
页数:14
相关论文
共 50 条
  • [21] Transformer-Based Federated Learning Models for Recommendation Systems
    Reddy, M. Sujaykumar
    Karnati, Hemanth
    Sundari, L. Mohana
    IEEE ACCESS, 2024, 12 : 109596 - 109607
  • [22] Smart Home Notifications in Croatian Language: A Transformer-Based Approach
    Simunec, Magdalena
    Soic, Renato
    2023 17TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, CONTEL, 2023,
  • [23] Transformer-Based Single-Cell Language Model: A Survey
    Lan, Wei
    He, Guohang
    Liu, Mingyang
    Chen, Qingfeng
    Cao, Junyue
    Peng, Wei
    BIG DATA MINING AND ANALYTICS, 2024, 7 (04): : 1169 - 1186
  • [24] AI-Assisted Text Composition for Automated Content Authoring Using Transformer-Based Language Models
    Alpdemir, Yusuf
    Alpdemir, Mahmut Nedim
    2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED SYSTEMS AND EMERGENT TECHNOLOGIES, ICASET 2024, 2024,
  • [25] From Captions to Explanations: A Multimodal Transformer-based Architecture for Natural Language Explanation Generation
    Rio-Torto, Isabel
    Cardoso, Jaime S.
    Teixeira, Luis F.
    PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2022), 2022, 13256 : 54 - 65
  • [26] Transformer-Based Models for the Automatic Indexing of Scientific Documents in French
    Angel Gonzalez, Jose
    Buscaldi, Davide
    Sanchis, Emilio
    Hurtado, Lluis-F
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 60 - 72
  • [27] An Ensemble of Arabic Transformer-based Models for Arabic Sentiment Analysis
    El Karfi, Ikram
    El Fkihi, Sanaa
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (08) : 561 - 567
  • [28] Enriching Transformer-Based Embeddings for Emotion Identification in an Agglutinative Language: Turkish
    Uymaz, Hande Aka
    Metin, Senem Kumova
    IT PROFESSIONAL, 2023, 25 (04) : 67 - 73
  • [29] ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit
    Kang, Myeonggu
    Park, Junyoung
    Shin, Hyein
    Shin, Jaekang
    Kim, Lee-Sup
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (09) : 2248 - 2261
  • [30] A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning
    Kotei, Evans
    Thirunavukarasu, Ramkumar
    INFORMATION, 2023, 14 (03)