Few-Shot Prompting

Few-shot prompting is the technique of including a small number of examples — typically 2 to 10 — directly in your prompt to show the model the exact input/output pattern you expect. It’s one of the most reliable ways to improve LLM output quality for specialized tasks.

The “few” in few-shot refers to the number of demonstrations: one example is one-shot, two or more is few-shot, and no examples is zero-shot.

Why Examples Work Better Than Instructions

You can often describe a task clearly in words and still get inconsistent outputs. But show the model three examples of what you want, and it gets it every time. Why?

Language models are, at a deep level, pattern completers. When you provide examples in the prompt, you’re establishing a strong, concrete pattern. The model’s next move is to continue that pattern with your actual input.

Without examples:
"Format the address for USPS."
→ Might capitalize differently each time, may or may not add USPS-specific formatting

With examples:
Address: 123 main st, new york, ny 10001
Formatted: 123 MAIN ST
           NEW YORK NY 10001

Address: 456 oak avenue apt 2b, chicago, illinois 60601
Formatted: 456 OAK AVE APT 2B
           CHICAGO IL 60601

Address: [your input]
Formatted:

The second version will produce consistent USPS formatting. The first won’t.

Anatomy of a Few-Shot Prompt

[Optional: Task description or system context]

[Example 1]
Input: [example input]
Output: [example output]

[Example 2]
Input: [example input]
Output: [example output]

[Example N]
Input: [example input]
Output: [example output]

Input: [actual input you want answered]
Output:

The trailing Output: is important — it signals to the model that it should generate the output, not provide commentary about the task.

Choosing Good Examples

The quality of your examples matters more than the quantity. Poor examples teach the model the wrong pattern.

Criteria for Good Few-Shot Examples

Representative: Cover the main cases the model will encounter, not just easy ones.

Diverse: Don’t use five examples that are all nearly identical. Cover different lengths, edge cases, and scenarios.

High quality: Every example should be perfect. One bad example can corrupt the pattern.

Correctly formatted: Exactly the output format you want in production. If you want JSON, every example should output valid JSON.

Balanced (for classification): If you’re classifying into 3 categories, try to include at least one example of each.

Practical Example: Entity Extraction

Suppose you want to extract key entities from support tickets:

Extract the product name, issue type, and severity from each ticket.
Return as JSON with keys: product, issue_type, severity (critical/high/medium/low).

---
Ticket: "The dashboard crashes immediately when I try to export reports to PDF.
         This is blocking our weekly finance review."
{
  "product": "dashboard",
  "issue_type": "crash",
  "severity": "critical"
}

---
Ticket: "The dark mode toggle doesn't save between sessions in the mobile app.
         Minor annoyance but consistent."
{
  "product": "mobile app",
  "issue_type": "settings persistence",
  "severity": "low"
}

---
Ticket: "Users are getting logged out mid-session on the web portal,
         causing data loss on long forms."
{
  "product": "web portal",
  "issue_type": "session timeout",
  "severity": "high"
}

---
Ticket: "The Slack integration stopped posting notifications 2 days ago.
         Our whole team relies on this for incident alerts."

The model will now extract consistently using the same JSON structure.

How Many Examples?

More isn’t always better. Token cost scales with example count, and very long example sets can dilute the model’s attention on the actual task.

General guidelines:
1 example (one-shot):   Good for format demonstrations
3–5 examples:           Sweet spot for most tasks
6–10 examples:          When you need to cover many edge cases
10+ examples:           Usually better to fine-tune instead

Research has found that past a certain point (roughly 8–10 examples for many tasks), additional examples provide diminishing returns. If you need 20+ examples to get consistent performance, consider fine-tuning.

Example Selection Strategies

Static Selection

You write the same examples for every query. Simple, predictable, works well when your inputs are homogeneous.

Dynamic Selection (Advanced)

Select examples that are most similar to the current input, retrieved from an example bank using semantic search.

# Pseudocode: dynamic few-shot selection
user_query = "The API is returning 500 errors on POST requests"
similar_examples = vector_db.search(user_query, top_k=3)
prompt = build_prompt(examples=similar_examples, query=user_query)

This is especially powerful for classification tasks with many categories, or when your input distribution is wide. The model sees examples closest to the current problem, so the pattern it learns is most relevant.

Common Few-Shot Mistakes

Inconsistent Formatting Between Examples

Bad example:
Input: "fix the bug"  ← lowercase, imperative
Output: "The solution is..."  ← full sentence

Next example:
Input: Fix the Bug  ← title case
Output: Fixed.  ← terse

The model learns an inconsistent pattern.

Leaking the Answer in the Prompt Structure

Sentiment: This movie was amazing!
→ [before showing the answer]
Answer: Positive

If your actual input is before you reveal the answer and the model can "see" formatting
that implies the answer, you're accidentally biasing results.

Using Examples That Are Too Similar to Each Other

If all five examples are short positive reviews, the model won’t learn how to handle long reviews or negative ones. Diversity in examples improves robustness.

Few-Shot vs. Fine-Tuning

A common question: when should I fine-tune instead of using few-shot examples?

Dimension	Few-Shot	Fine-Tuning
Setup time	Minutes	Days/weeks
Cost	Token cost per request	Compute cost upfront
Flexibility	Change examples easily	Fixed until retrained
Performance	Good for most tasks	Better for specialized tasks
Context overhead	Uses tokens	No extra tokens at inference
Privacy	Examples go to API	Training data stays local

The practical answer: start with few-shot. If you need consistent performance at scale with low latency, and you have enough high-quality examples, fine-tuning becomes worth the investment.

In-Context Learning: Why It Works

The ability to learn from in-context examples without any weight updates is called in-context learning (ICL), and it’s one of the more surprising capabilities of large language models.

The current understanding is that ICL works by activating relevant task-specific circuits in the model that were formed during pre-training. The examples don’t “teach” the model in the traditional sense — they help it identify which of its existing knowledge applies to this specific task format.

This also explains why example quality matters so much: bad examples don’t teach the model bad habits (the weights don’t change), but they do confuse the task framing, causing the model to activate the wrong internal circuit.