Dialogue Summarization with FLAN-T5 & Prompt Engineering

Llms
Dialogue Summarization with FLAN-T5 & Prompt Engineering

Overview

This project explores conversation summarization with large language models (LLMs).

Using Hugging Face’s FLAN-T5 model, I built an experiment that takes multi-turn dialogues (e.g., support chats between a customer and an agent) and generates concise summaries. The focus of the lab is not just on using a pretrained model, but on:

  • Understanding how tokenization and text generation work.
  • Exploring zero-shot, one-shot, and few-shot prompt engineering.
  • Tuning generation parameters (temperature, sampling) to control model behavior.

A typical real-world scenario would be: “At the end of the month, summarize all customer support conversations into key issues and resolutions.”


Tech Stack

  • Language & Environment

    • Python 3
    • Jupyter Notebook (interactive experimentation)
  • Core Libraries

    • PyTorch – deep learning backend for FLAN-T5
    • torchdata – helpers for dataset loading
    • Hugging Face Transformers – FLAN-T5 model, tokenizer, generation
    • Hugging Face Datasets – for loading the public DialogueSum dataset
  • Compute

    • CPU-based environment (8 vCPUs, 32 GB RAM) suitable for experimentation and small-scale inference.

Dataset: DialogueSum

I used the DialogueSum dataset via datasets.load_dataset, which contains:

  • Multi-turn dialogues between Person 1 and Person 2.
  • A corresponding human-written summary for each conversation.

Each training example looks like:

  • dialogue: the full conversation text.
  • summary: a human baseline summary.

This gives a clear target to compare model-generated summaries against.


Model & Tokenization

FLAN-T5

  • Loaded a FLAN-T5 checkpoint from Hugging Face.
  • FLAN-T5 is a general-purpose sequence-to-sequence model that supports tasks like:
    • Summarization
    • Translation
    • Question answering
    • Instruction following

Tokenizer

  • Used the matching T5 tokenizer to:
    • Convert raw text to token IDs (integers).
    • Convert token IDs back to human-readable text.
  • Verified the round-trip: encoding “What time is it, Tom?” to IDs and decoding back to the original sentence.
  • Under the hood, these IDs index into embeddings, which are used during forward passes and backprop.

Experiments

1. Baseline Zero-Shot Summarization

First, I passed dialogues directly into FLAN-T5 without any explicit instruction:

  • Input: the raw dialogue only.
  • Output: model-generated summary.

Results:

  • The model picked up small details (e.g., “It’s 10 to 9”) but missed most of the context.
  • Summaries were often too short or incomplete.

This gave me a baseline of what the raw model can do with no guidance.


2. Zero-Shot with Instructions (Prompt Engineering)

Next, I added explicit instructions to the input:

Summarize the following conversation:

dialogue

Summary Dialogue: dialogue

What was going on in the above dialogue? Findings:

Instruction prompts improved the summaries slightly (the model understood it should generate a summary).

However, the output was still often shallow or missed important details.

This step illustrated the impact of instruction tuning and prompt design even before any fine-tuning.

  1. One-Shot In-Context Learning

Then I tried one-shot prompting:

Provide one full example:

Dialogue + correct human summary.

Then provide a second dialogue and ask the model to produce the summary.

Structure (simplified):

Example 1: Dialogue: dialogue_1 Summary: human_summary_1

Example 2: Dialogue: dialogue_2 Summary:

Findings:

The model began to mimic the style and level of detail of the human summary.

Summaries were richer and closer to the desired output.

Demonstrated the power of in-context learning for LLMs, even without changing model weights.

  1. Few-Shot In-Context Learning

I extended this idea to few-shot prompting:

3 examples: (dialogue, human summary) pairs.

1 new dialogue without its summary.

Example 1: dialogue + summary Example 2: dialogue + summary Example 3: dialogue + summary

Dialogue: dialogue_4 Summary:

Findings:

For some cases, few-shot did not significantly beat one-shot.

Important practical lesson: performance doesn’t always scale linearly with “more shots”; after ~3–5 examples, gains often saturate.

Also encountered context length limits (~512 tokens) – longer dialogues can hit the model’s maximum sequence length, causing warnings or truncation.

  1. Generation Configuration (Temperature & Sampling)

In the last part, I played with generation parameters:

Temperature

Low temperature (~0.1): conservative, repetitive, more deterministic summaries.

High temperature (~1.0–2.0): more diverse and creative, sometimes too “wild” for summarization.

Sampling strategies

Controlled how diverse the model output can be.

For a production summarization system, you’d typically use lower temperature for reliability.

This experimentation helped build intuition about how decoding strategies affect the final text quality.

What I Learned

How to integrate Hugging Face Transformers and Datasets with PyTorch in a practical NLP workflow.

The importance of prompt engineering:

Zero-shot vs one-shot vs few-shot.

Writing clear, task-specific instructions.

How in-context learning can dramatically improve model performance without any weight updates.

Trade-offs between deterministic vs creative outputs using temperature and sampling methods.

Practical considerations like context length limits and the diminishing returns of adding more examples to a prompt.

This project is a strong foundation for future work such as:

Fine-tuning FLAN-T5 (e.g., with LoRA/PEFT).

Building an automated summarization service for real customer support logs.