Overview
This project explores how to adapt a general-purpose LLM to a domain-specific summarization task using two strategies:
- Full instruction fine-tuning of Flan-T5.
- Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters.
The goal is to improve summarization quality on a dialogue dataset while understanding the trade-offs between accuracy, model size, and compute cost.
Tech Stack
- Model: Flan-T5 (Seq2Seq LLM)
- Frameworks: Hugging Face Transformers, PEFT, PyTorch, torchdata
- Evaluation:
evaluatelibrary with ROUGE (rouge1, rouge2, rougeL, rougeLsum) - Environment: AWS SageMaker (
ml.m5.2xlarge– 8 vCPUs, 32 GB RAM) - Language: Python
Task & Dataset
- Task: Summarize multi-turn conversations into short, human-readable summaries.
- Baselines:
- Human reference summaries (gold).
- Zero-shot Flan-T5 with an instruction prompt:
"Summarize the following conversation:\n\n<dialogue>\n\nSummary:"
- Data split (for the lab run):
- ~125 examples for training
- 5 examples for validation
- 15 examples for test/holdout
- A larger offline run uses the full dataset to obtain more stable ROUGE metrics.
Each example is converted into an instruction-style prompt so the model learns to follow the summarization instruction consistently.
Part 1 – Full Instruction Fine-Tuning
Setup
-
Loaded base Flan-T5 using
AutoModelForSeq2SeqLMand its tokenizer. -
Wrapped conversations into instruction prompts: Summarize the following conversation:
Summary: Used Hugging Face’s Trainer and TrainingArguments for full fine-tuning.
Key training details:
Learning rate tuned for stability on small dataset.
Small number of epochs and max_steps to keep lab compute low.
Metrics: ROUGE scores computed via the evaluate library.
Counted trainable parameters: ~250M parameters updated during full fine-tuning.
For better performance, an extended version of the model was trained offline (more epochs, more steps) and loaded from S3 as a checkpoint:
Fully fine-tuned checkpoint size: ~945 MB.
Used this as the main instruction-tuned model for evaluation.
Qualitative Results
On sample dialogues:
Zero-shot Flan-T5 often repeated phrases or missed important details.
Instruction-tuned Flan-T5:
Captured conversations more accurately.
Included key entities and actions (e.g., “Person 1 suggests upgrading hardware and CD-ROM; Person 2 agrees”).
Quantitative Results (ROUGE)
On the small test set and on a larger offline evaluation:
Instruction-tuned model significantly outperformed the base Flan-T5.
Example relative improvements vs. original model:
ROUGE-1: ~+18%
ROUGE-2: ~+10%
ROUGE-L: ~+13%
ROUGE-Lsum: ~+13.7%
This demonstrates how even modest fine-tuning on a focused dataset can materially improve summarization quality.
