Flan-T5 Summarization Fine-Tuning with PEFT (LoRA) on AWS SageMaker

Llms
Flan-T5 Summarization Fine-Tuning with PEFT (LoRA) on AWS SageMaker

Overview

This project explores how to adapt a general-purpose LLM to a domain-specific summarization task using two strategies:

  1. Full instruction fine-tuning of Flan-T5.
  2. Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters.

The goal is to improve summarization quality on a dialogue dataset while understanding the trade-offs between accuracy, model size, and compute cost.


Tech Stack

  • Model: Flan-T5 (Seq2Seq LLM)
  • Frameworks: Hugging Face Transformers, PEFT, PyTorch, torchdata
  • Evaluation: evaluate library with ROUGE (rouge1, rouge2, rougeL, rougeLsum)
  • Environment: AWS SageMaker (ml.m5.2xlarge – 8 vCPUs, 32 GB RAM)
  • Language: Python

Task & Dataset

  • Task: Summarize multi-turn conversations into short, human-readable summaries.
  • Baselines:
    • Human reference summaries (gold).
    • Zero-shot Flan-T5 with an instruction prompt:
      "Summarize the following conversation:\n\n<dialogue>\n\nSummary:"
  • Data split (for the lab run):
    • ~125 examples for training
    • 5 examples for validation
    • 15 examples for test/holdout
  • A larger offline run uses the full dataset to obtain more stable ROUGE metrics.

Each example is converted into an instruction-style prompt so the model learns to follow the summarization instruction consistently.


Part 1 – Full Instruction Fine-Tuning

Setup

  • Loaded base Flan-T5 using AutoModelForSeq2SeqLM and its tokenizer.

  • Wrapped conversations into instruction prompts: Summarize the following conversation:

    Summary: Used Hugging Face’s Trainer and TrainingArguments for full fine-tuning.

Key training details:

Learning rate tuned for stability on small dataset.

Small number of epochs and max_steps to keep lab compute low.

Metrics: ROUGE scores computed via the evaluate library.

Counted trainable parameters: ~250M parameters updated during full fine-tuning.

For better performance, an extended version of the model was trained offline (more epochs, more steps) and loaded from S3 as a checkpoint:

Fully fine-tuned checkpoint size: ~945 MB.

Used this as the main instruction-tuned model for evaluation.

Qualitative Results

On sample dialogues:

Zero-shot Flan-T5 often repeated phrases or missed important details.

Instruction-tuned Flan-T5:

Captured conversations more accurately.

Included key entities and actions (e.g., “Person 1 suggests upgrading hardware and CD-ROM; Person 2 agrees”).

Quantitative Results (ROUGE)

On the small test set and on a larger offline evaluation:

Instruction-tuned model significantly outperformed the base Flan-T5.

Example relative improvements vs. original model:

ROUGE-1: ~+18%

ROUGE-2: ~+10%

ROUGE-L: ~+13%

ROUGE-Lsum: ~+13.7%

This demonstrates how even modest fine-tuning on a focused dataset can materially improve summarization quality.