Detoxifying FLAN-T5 with RLHF (PPO + Hate-Speech Reward Model)

Overview

This project applies Reinforcement Learning from Human Feedback (RLHF) to an instruction-fine-tuned FLAN-T5 dialogue summarization model.

Starting from a summarization model I trained in a previous lab, I:

Attached a hate-speech classifier as a reward model.
Used PPO (Proximal Policy Optimization) to optimize for “not hate”.
Constrained learning with KL-divergence against a frozen reference model to prevent reward hacking.
Evaluated the model’s toxicity and summarization quality before and after RLHF.

The goal: reduce toxicity in generated summaries while preserving relevance and fluency.

Take an instruction-tuned FLAN-T5 summarization model (dialogue → summary).
Use RLHF to lower the toxicity of the generated summaries.
Keep the model close to the original via KL-regularization and a frozen reference model.
Quantitatively measure toxicity scores and qualitatively compare responses.

Model & Libraries
- FLAN-T5 (seq2seq LLM)
- Hugging Face transformers, datasets, trl
- PyTorch for training
- PEFT / LoRA for parameter-efficient fine-tuning
- evaluate (toxicity + ROUGE)
- tqdm for progress tracking
Reward Model
- Facebook RoBERTa hate-speech classifier
- Loaded via AutoModelForSequenceClassification
- Binary labels: not_hate (index 0), hate (index 1)
Environment
- Jupyter notebook on AWS (SageMaker lab environment)
- GPU-enabled instance

Loaded the Lab 2 FLAN-T5 checkpoint:
- Instruction-tuned for dialogue summarization.
- Fine-tuned previously using LoRA/PEFT, training only ~1.4% of parameters.
Wrapped the model using AutoModelForSeq2SeqLMWithValueHead from TRL to support PPO’s value head.

Loaded a dialogue dataset from datasets.
Built a helper function build_dataset to:
- Sample text with LengthSampler (respecting a 512-token window).
- Tokenize with the correct FLAN-T5 tokenizer.
- Wrap each dialogue in an instruction prompt (same format as Labs 1–2).

Loaded Facebook’s RoBERTa hate-speech model as a binary classifier:
- Input: (prompt + generated summary) concatenated.
- Output logits → probability of not_hate vs hate.
Carefully fixed the index mapping:
- not_hate_index = 0
- hate_index = 1
Constructed the reward signal from the not_hate logit:
- Higher not_hate → higher reward.

Created a reference model (frozen) using TRL:
- Same weights as the original instruction-tuned FLAN-T5.
- Never updated during PPO.
PPO loss includes a KL-divergence term:
- Encourages the updated policy to stay close to the reference.
- Prevents reward hacking (e.g., weird text that is non-toxic but irrelevant).

For each batch:

Training configuration highlights:

Small number of PPO epochs (for lab runtime).
batch_size = 16
Monitored:
- KL-divergence (should remain bounded)
- Mean reward / advantage (should increase)

Used the evaluate library’s toxicity metric.
Built a helper function to:
- Sample multiple dialogues,
- Generate summaries,
- Compute mean toxicity and standard deviation.

Compared:

Before PPO (baseline FLAN-T5):
- Baseline mean toxicity.
After PPO (RLHF-tuned FLAN-T5):
- Mean toxicity decreased.
- KL divergence remained within reasonable bounds.

Displayed side-by-side examples:
- Original prompt (dialogue + instruction).
- Before RLHF summary.
- After RLHF summary.
- Reward scores for each.
Observed that:
- After RLHF, summaries were less toxic / less aggressive in tone.
- Content remained on topic and coherent, thanks to KL regularization.

How to implement a full RLHF loop using:
- Policy model (FLAN-T5),
- Reward model (hate-speech classifier),
- Reference model + KL divergence,
- PPOTrainer from TRL.
The importance of label/index mapping in reward models (getting the not_hate index right).
How PEFT/LoRA allows RLHF customization while only training ~1–2% of parameters.
Practical trade-offs between toxicity reduction and fidelity to the original model.
How to combine transformers, TRL, and evaluate into a reproducible RLHF experiment.