Overview
This project explores conversation summarization with large language models (LLMs).
Using Hugging Face’s FLAN-T5 model, I built an experiment that takes multi-turn dialogues (e.g., support chats between a customer and an agent) and generates concise summaries. The focus of the lab is not just on using a pretrained model, but on:
- Understanding how tokenization and text generation work.
- Exploring zero-shot, one-shot, and few-shot prompt engineering.
- Tuning generation parameters (temperature, sampling) to control model behavior.
A typical real-world scenario would be: “At the end of the month, summarize all customer support conversations into key issues and resolutions.”
Tech Stack
-
Language & Environment
- Python 3
- Jupyter Notebook (interactive experimentation)
-
Core Libraries
- PyTorch – deep learning backend for FLAN-T5
- torchdata – helpers for dataset loading
- Hugging Face Transformers – FLAN-T5 model, tokenizer, generation
- Hugging Face Datasets – for loading the public DialogueSum dataset
-
Compute
- CPU-based environment (8 vCPUs, 32 GB RAM) suitable for experimentation and small-scale inference.
Dataset: DialogueSum
I used the DialogueSum dataset via datasets.load_dataset, which contains:
- Multi-turn dialogues between Person 1 and Person 2.
- A corresponding human-written summary for each conversation.
Each training example looks like:
dialogue: the full conversation text.summary: a human baseline summary.
This gives a clear target to compare model-generated summaries against.
Model & Tokenization
FLAN-T5
- Loaded a FLAN-T5 checkpoint from Hugging Face.
- FLAN-T5 is a general-purpose sequence-to-sequence model that supports tasks like:
- Summarization
- Translation
- Question answering
- Instruction following
Tokenizer
- Used the matching T5 tokenizer to:
- Convert raw text to token IDs (integers).
- Convert token IDs back to human-readable text.
- Verified the round-trip: encoding “What time is it, Tom?” to IDs and decoding back to the original sentence.
- Under the hood, these IDs index into embeddings, which are used during forward passes and backprop.
Experiments
1. Baseline Zero-Shot Summarization
First, I passed dialogues directly into FLAN-T5 without any explicit instruction:
- Input: the raw dialogue only.
- Output: model-generated summary.
Results:
- The model picked up small details (e.g., “It’s 10 to 9”) but missed most of the context.
- Summaries were often too short or incomplete.
This gave me a baseline of what the raw model can do with no guidance.
2. Zero-Shot with Instructions (Prompt Engineering)
Next, I added explicit instructions to the input:
Summarize the following conversation:
dialogue
Summary Dialogue: dialogue
What was going on in the above dialogue? Findings:
Instruction prompts improved the summaries slightly (the model understood it should generate a summary).
However, the output was still often shallow or missed important details.
This step illustrated the impact of instruction tuning and prompt design even before any fine-tuning.
- One-Shot In-Context Learning
Then I tried one-shot prompting:
Provide one full example:
Dialogue + correct human summary.
Then provide a second dialogue and ask the model to produce the summary.
Structure (simplified):
Example 1: Dialogue: dialogue_1 Summary: human_summary_1
Example 2: Dialogue: dialogue_2 Summary:
Findings:
The model began to mimic the style and level of detail of the human summary.
Summaries were richer and closer to the desired output.
Demonstrated the power of in-context learning for LLMs, even without changing model weights.
- Few-Shot In-Context Learning
I extended this idea to few-shot prompting:
3 examples: (dialogue, human summary) pairs.
1 new dialogue without its summary.
Example 1: dialogue + summary Example 2: dialogue + summary Example 3: dialogue + summary
Dialogue: dialogue_4 Summary:
Findings:
For some cases, few-shot did not significantly beat one-shot.
Important practical lesson: performance doesn’t always scale linearly with “more shots”; after ~3–5 examples, gains often saturate.
Also encountered context length limits (~512 tokens) – longer dialogues can hit the model’s maximum sequence length, causing warnings or truncation.
- Generation Configuration (Temperature & Sampling)
In the last part, I played with generation parameters:
Temperature
Low temperature (~0.1): conservative, repetitive, more deterministic summaries.
High temperature (~1.0–2.0): more diverse and creative, sometimes too “wild” for summarization.
Sampling strategies
Controlled how diverse the model output can be.
For a production summarization system, you’d typically use lower temperature for reliability.
This experimentation helped build intuition about how decoding strategies affect the final text quality.
What I Learned
How to integrate Hugging Face Transformers and Datasets with PyTorch in a practical NLP workflow.
The importance of prompt engineering:
Zero-shot vs one-shot vs few-shot.
Writing clear, task-specific instructions.
How in-context learning can dramatically improve model performance without any weight updates.
Trade-offs between deterministic vs creative outputs using temperature and sampling methods.
Practical considerations like context length limits and the diminishing returns of adding more examples to a prompt.
This project is a strong foundation for future work such as:
Fine-tuning FLAN-T5 (e.g., with LoRA/PEFT).
Building an automated summarization service for real customer support logs.
