Overview
This project applies Reinforcement Learning from Human Feedback (RLHF) to an instruction-fine-tuned FLAN-T5 dialogue summarization model.
Starting from a summarization model I trained in a previous lab, I:
- Attached a hate-speech classifier as a reward model.
- Used PPO (Proximal Policy Optimization) to optimize for “not hate”.
- Constrained learning with KL-divergence against a frozen reference model to prevent reward hacking.
- Evaluated the model’s toxicity and summarization quality before and after RLHF.
The goal: reduce toxicity in generated summaries while preserving relevance and fluency.
Objectives
- Take an instruction-tuned FLAN-T5 summarization model (dialogue → summary).
- Use RLHF to lower the toxicity of the generated summaries.
- Keep the model close to the original via KL-regularization and a frozen reference model.
- Quantitatively measure toxicity scores and qualitatively compare responses.
Tech Stack
-
Model & Libraries
- FLAN-T5 (seq2seq LLM)
- Hugging Face transformers, datasets, trl
- PyTorch for training
- PEFT / LoRA for parameter-efficient fine-tuning
- evaluate (toxicity + ROUGE)
- tqdm for progress tracking
-
Reward Model
- Facebook RoBERTa hate-speech classifier
- Loaded via
AutoModelForSequenceClassification - Binary labels:
not_hate(index 0),hate(index 1)
-
Environment
- Jupyter notebook on AWS (SageMaker lab environment)
- GPU-enabled instance
Approach
1. Starting point: Instruction-tuned policy model
- Loaded the Lab 2 FLAN-T5 checkpoint:
- Instruction-tuned for dialogue summarization.
- Fine-tuned previously using LoRA/PEFT, training only ~1.4% of parameters.
- Wrapped the model using
AutoModelForSeq2SeqLMWithValueHeadfrom TRL to support PPO’s value head.
2. Dataset preparation
- Loaded a dialogue dataset from
datasets. - Built a helper function
build_datasetto:- Sample text with LengthSampler (respecting a 512-token window).
- Tokenize with the correct FLAN-T5 tokenizer.
- Wrap each dialogue in an instruction prompt (same format as Labs 1–2).
3. Reward model: Hate-speech classifier
- Loaded Facebook’s RoBERTa hate-speech model as a binary classifier:
- Input:
(prompt + generated summary)concatenated. - Output logits → probability of
not_hatevshate.
- Input:
- Carefully fixed the index mapping:
not_hate_index = 0hate_index = 1
- Constructed the reward signal from the
not_hatelogit:- Higher
not_hate→ higher reward.
- Higher
4. Reference model & KL divergence
- Created a reference model (frozen) using TRL:
- Same weights as the original instruction-tuned FLAN-T5.
- Never updated during PPO.
- PPO loss includes a KL-divergence term:
- Encourages the updated policy to stay close to the reference.
- Prevents reward hacking (e.g., weird text that is non-toxic but irrelevant).
5. PPO training loop (RLHF)
For each batch:
- Use the policy model to generate summaries for dialogue prompts.
- Concatenate
prompt + summaryand score with the hate-speech reward model. - Extract the
not_hatelogit as the reward. - Feed
(query, response, reward)intoPPOTrainer. - PPO updates only the LoRA adapter weights (≈1.4% of total parameters).
Training configuration highlights:
- Small number of PPO epochs (for lab runtime).
batch_size = 16- Monitored:
- KL-divergence (should remain bounded)
- Mean reward / advantage (should increase)
Evaluation
Toxicity evaluation
- Used the
evaluatelibrary’s toxicity metric. - Built a helper function to:
- Sample multiple dialogues,
- Generate summaries,
- Compute mean toxicity and standard deviation.
Compared:
- Before PPO (baseline FLAN-T5):
- Baseline mean toxicity.
- After PPO (RLHF-tuned FLAN-T5):
- Mean toxicity decreased.
- KL divergence remained within reasonable bounds.
Qualitative comparison
- Displayed side-by-side examples:
- Original prompt (dialogue + instruction).
- Before RLHF summary.
- After RLHF summary.
- Reward scores for each.
- Observed that:
- After RLHF, summaries were less toxic / less aggressive in tone.
- Content remained on topic and coherent, thanks to KL regularization.
What I Learned
- How to implement a full RLHF loop using:
- Policy model (FLAN-T5),
- Reward model (hate-speech classifier),
- Reference model + KL divergence,
PPOTrainerfrom TRL.
- The importance of label/index mapping in reward models (getting the
not_hateindex right). - How PEFT/LoRA allows RLHF customization while only training ~1–2% of parameters.
- Practical trade-offs between toxicity reduction and fidelity to the original model.
- How to combine transformers, TRL, and evaluate into a reproducible RLHF experiment.
