Flan-T5 Summarization Fine-Tuning with PEFT (LoRA) on AWS SageMaker

Overview

This project explores how to adapt a general-purpose LLM to a domain-specific summarization task using two strategies:

The goal is to improve summarization quality on a dialogue dataset while understanding the trade-offs between accuracy, model size, and compute cost.

Model: Flan-T5 (Seq2Seq LLM)
Frameworks: Hugging Face Transformers, PEFT, PyTorch, torchdata
Evaluation: evaluate library with ROUGE (rouge1, rouge2, rougeL, rougeLsum)
Environment: AWS SageMaker (ml.m5.2xlarge – 8 vCPUs, 32 GB RAM)
Language: Python

Task: Summarize multi-turn conversations into short, human-readable summaries.
Baselines:
- Human reference summaries (gold).
- Zero-shot Flan-T5 with an instruction prompt:
  "Summarize the following conversation:\n\n<dialogue>\n\nSummary:"
Data split (for the lab run):
- ~125 examples for training
- 5 examples for validation
- 15 examples for test/holdout
A larger offline run uses the full dataset to obtain more stable ROUGE metrics.

Each example is converted into an instruction-style prompt so the model learns to follow the summarization instruction consistently.

Loaded base Flan-T5 using AutoModelForSeq2SeqLM and its tokenizer.
Wrapped conversations into instruction prompts: Summarize the following conversation:

Summary: Used Hugging Face’s Trainer and TrainingArguments for full fine-tuning.

Key training details:

Learning rate tuned for stability on small dataset.

Small number of epochs and max_steps to keep lab compute low.

Metrics: ROUGE scores computed via the evaluate library.

Counted trainable parameters: ~250M parameters updated during full fine-tuning.

For better performance, an extended version of the model was trained offline (more epochs, more steps) and loaded from S3 as a checkpoint:

Fully fine-tuned checkpoint size: ~945 MB.

Used this as the main instruction-tuned model for evaluation.

Qualitative Results

On sample dialogues:

Zero-shot Flan-T5 often repeated phrases or missed important details.

Instruction-tuned Flan-T5:

Captured conversations more accurately.

Included key entities and actions (e.g., “Person 1 suggests upgrading hardware and CD-ROM; Person 2 agrees”).

Quantitative Results (ROUGE)

On the small test set and on a larger offline evaluation:

Instruction-tuned model significantly outperformed the base Flan-T5.

Example relative improvements vs. original model:

ROUGE-1: ~+18%

ROUGE-2: ~+10%

ROUGE-L: ~+13%

ROUGE-Lsum: ~+13.7%

This demonstrates how even modest fine-tuning on a focused dataset can materially improve summarization quality.