Dialogue Summarization with FLAN-T5 & Prompt Engineering

Overview

This project explores conversation summarization with large language models (LLMs).

Using Hugging Face’s FLAN-T5 model, I built an experiment that takes multi-turn dialogues (e.g., support chats between a customer and an agent) and generates concise summaries. The focus of the lab is not just on using a pretrained model, but on:

Understanding how tokenization and text generation work.
Exploring zero-shot, one-shot, and few-shot prompt engineering.
Tuning generation parameters (temperature, sampling) to control model behavior.

A typical real-world scenario would be: “At the end of the month, summarize all customer support conversations into key issues and resolutions.”

Tech Stack

Language & Environment
- Python 3
- Jupyter Notebook (interactive experimentation)
Core Libraries
- PyTorch – deep learning backend for FLAN-T5
- torchdata – helpers for dataset loading
- Hugging Face Transformers – FLAN-T5 model, tokenizer, generation
- Hugging Face Datasets – for loading the public DialogueSum dataset
Compute
- CPU-based environment (8 vCPUs, 32 GB RAM) suitable for experimentation and small-scale inference.

Dataset: DialogueSum

I used the DialogueSum dataset via datasets.load_dataset, which contains:

Multi-turn dialogues between Person 1 and Person 2.
A corresponding human-written summary for each conversation.

Each training example looks like:

dialogue: the full conversation text.
summary: a human baseline summary.

This gives a clear target to compare model-generated summaries against.

Model & Tokenization

FLAN-T5

Loaded a FLAN-T5 checkpoint from Hugging Face.
FLAN-T5 is a general-purpose sequence-to-sequence model that supports tasks like:
- Summarization
- Translation
- Question answering
- Instruction following

Tokenizer

Used the matching T5 tokenizer to:
- Convert raw text to token IDs (integers).
- Convert token IDs back to human-readable text.
Verified the round-trip: encoding “What time is it, Tom?” to IDs and decoding back to the original sentence.
Under the hood, these IDs index into embeddings, which are used during forward passes and backprop.

Experiments

1. Baseline Zero-Shot Summarization

First, I passed dialogues directly into FLAN-T5 without any explicit instruction:

Input: the raw dialogue only.
Output: model-generated summary.

Results:

The model picked up small details (e.g., “It’s 10 to 9”) but missed most of the context.
Summaries were often too short or incomplete.

This gave me a baseline of what the raw model can do with no guidance.

2. Zero-Shot with Instructions (Prompt Engineering)

Next, I added explicit instructions to the input:

Summarize the following conversation:

dialogue

Summary Dialogue: dialogue

What was going on in the above dialogue? Findings:

Instruction prompts improved the summaries slightly (the model understood it should generate a summary).

However, the output was still often shallow or missed important details.

This step illustrated the impact of instruction tuning and prompt design even before any fine-tuning.

One-Shot In-Context Learning

Then I tried one-shot prompting:

Provide one full example:

Dialogue + correct human summary.

Then provide a second dialogue and ask the model to produce the summary.

Structure (simplified):

Example 1: Dialogue: dialogue_1 Summary: human_summary_1

Example 2: Dialogue: dialogue_2 Summary:

Findings:

The model began to mimic the style and level of detail of the human summary.

Summaries were richer and closer to the desired output.

Demonstrated the power of in-context learning for LLMs, even without changing model weights.

Few-Shot In-Context Learning

I extended this idea to few-shot prompting:

3 examples: (dialogue, human summary) pairs.

1 new dialogue without its summary.

Example 1: dialogue + summary Example 2: dialogue + summary Example 3: dialogue + summary

Dialogue: dialogue_4 Summary:

Findings:

For some cases, few-shot did not significantly beat one-shot.

Important practical lesson: performance doesn’t always scale linearly with “more shots”; after ~3–5 examples, gains often saturate.

Also encountered context length limits (~512 tokens) – longer dialogues can hit the model’s maximum sequence length, causing warnings or truncation.

Generation Configuration (Temperature & Sampling)

In the last part, I played with generation parameters:

Temperature

Low temperature (~0.1): conservative, repetitive, more deterministic summaries.

High temperature (~1.0–2.0): more diverse and creative, sometimes too “wild” for summarization.

Sampling strategies

Controlled how diverse the model output can be.

For a production summarization system, you’d typically use lower temperature for reliability.

This experimentation helped build intuition about how decoding strategies affect the final text quality.

What I Learned

How to integrate Hugging Face Transformers and Datasets with PyTorch in a practical NLP workflow.

The importance of prompt engineering:

Zero-shot vs one-shot vs few-shot.

Writing clear, task-specific instructions.

How in-context learning can dramatically improve model performance without any weight updates.

Trade-offs between deterministic vs creative outputs using temperature and sampling methods.

Practical considerations like context length limits and the diminishing returns of adding more examples to a prompt.

This project is a strong foundation for future work such as:

Fine-tuning FLAN-T5 (e.g., with LoRA/PEFT).

Building an automated summarization service for real customer support logs.