Detoxifying FLAN-T5 with RLHF (PPO + Hate-Speech Reward Model)

Generative AI, Responsible AI
Detoxifying FLAN-T5 with RLHF (PPO + Hate-Speech Reward Model)

Overview

This project applies Reinforcement Learning from Human Feedback (RLHF) to an instruction-fine-tuned FLAN-T5 dialogue summarization model.

Starting from a summarization model I trained in a previous lab, I:

  • Attached a hate-speech classifier as a reward model.
  • Used PPO (Proximal Policy Optimization) to optimize for “not hate”.
  • Constrained learning with KL-divergence against a frozen reference model to prevent reward hacking.
  • Evaluated the model’s toxicity and summarization quality before and after RLHF.

The goal: reduce toxicity in generated summaries while preserving relevance and fluency.


Objectives

  • Take an instruction-tuned FLAN-T5 summarization model (dialogue → summary).
  • Use RLHF to lower the toxicity of the generated summaries.
  • Keep the model close to the original via KL-regularization and a frozen reference model.
  • Quantitatively measure toxicity scores and qualitatively compare responses.

Tech Stack

  • Model & Libraries

    • FLAN-T5 (seq2seq LLM)
    • Hugging Face transformers, datasets, trl
    • PyTorch for training
    • PEFT / LoRA for parameter-efficient fine-tuning
    • evaluate (toxicity + ROUGE)
    • tqdm for progress tracking
  • Reward Model

    • Facebook RoBERTa hate-speech classifier
    • Loaded via AutoModelForSequenceClassification
    • Binary labels: not_hate (index 0), hate (index 1)
  • Environment

    • Jupyter notebook on AWS (SageMaker lab environment)
    • GPU-enabled instance

Approach

1. Starting point: Instruction-tuned policy model

  • Loaded the Lab 2 FLAN-T5 checkpoint:
    • Instruction-tuned for dialogue summarization.
    • Fine-tuned previously using LoRA/PEFT, training only ~1.4% of parameters.
  • Wrapped the model using AutoModelForSeq2SeqLMWithValueHead from TRL to support PPO’s value head.

2. Dataset preparation

  • Loaded a dialogue dataset from datasets.
  • Built a helper function build_dataset to:
    • Sample text with LengthSampler (respecting a 512-token window).
    • Tokenize with the correct FLAN-T5 tokenizer.
    • Wrap each dialogue in an instruction prompt (same format as Labs 1–2).

3. Reward model: Hate-speech classifier

  • Loaded Facebook’s RoBERTa hate-speech model as a binary classifier:
    • Input: (prompt + generated summary) concatenated.
    • Output logits → probability of not_hate vs hate.
  • Carefully fixed the index mapping:
    • not_hate_index = 0
    • hate_index = 1
  • Constructed the reward signal from the not_hate logit:
    • Higher not_hate → higher reward.

4. Reference model & KL divergence

  • Created a reference model (frozen) using TRL:
    • Same weights as the original instruction-tuned FLAN-T5.
    • Never updated during PPO.
  • PPO loss includes a KL-divergence term:
    • Encourages the updated policy to stay close to the reference.
    • Prevents reward hacking (e.g., weird text that is non-toxic but irrelevant).

5. PPO training loop (RLHF)

For each batch:

  1. Use the policy model to generate summaries for dialogue prompts.
  2. Concatenate prompt + summary and score with the hate-speech reward model.
  3. Extract the not_hate logit as the reward.
  4. Feed (query, response, reward) into PPOTrainer.
  5. PPO updates only the LoRA adapter weights (≈1.4% of total parameters).

Training configuration highlights:

  • Small number of PPO epochs (for lab runtime).
  • batch_size = 16
  • Monitored:
    • KL-divergence (should remain bounded)
    • Mean reward / advantage (should increase)

Evaluation

Toxicity evaluation

  • Used the evaluate library’s toxicity metric.
  • Built a helper function to:
    • Sample multiple dialogues,
    • Generate summaries,
    • Compute mean toxicity and standard deviation.

Compared:

  • Before PPO (baseline FLAN-T5):
    • Baseline mean toxicity.
  • After PPO (RLHF-tuned FLAN-T5):
    • Mean toxicity decreased.
    • KL divergence remained within reasonable bounds.

Qualitative comparison

  • Displayed side-by-side examples:
    • Original prompt (dialogue + instruction).
    • Before RLHF summary.
    • After RLHF summary.
    • Reward scores for each.
  • Observed that:
    • After RLHF, summaries were less toxic / less aggressive in tone.
    • Content remained on topic and coherent, thanks to KL regularization.

What I Learned

  • How to implement a full RLHF loop using:
    • Policy model (FLAN-T5),
    • Reward model (hate-speech classifier),
    • Reference model + KL divergence,
    • PPOTrainer from TRL.
  • The importance of label/index mapping in reward models (getting the not_hate index right).
  • How PEFT/LoRA allows RLHF customization while only training ~1–2% of parameters.
  • Practical trade-offs between toxicity reduction and fidelity to the original model.
  • How to combine transformers, TRL, and evaluate into a reproducible RLHF experiment.