Chain of Thought Prompting

Here’s a puzzle. You ask an LLM a multi-step math problem and it gets it wrong. You ask the exact same question but add “think step by step” — and it gets it right. Why?

This is chain-of-thought (CoT) prompting, and understanding why it works helps you apply it correctly and know when to use it.

The Core Idea

Standard prompting asks the model to go directly from question to answer. Chain-of-thought prompting encourages the model to show its intermediate reasoning first, then arrive at a final answer.

Standard prompting:
Q: If a shirt costs $45 and is 30% off, then you apply a $5 coupon, what do you pay?
A: $26.50  ← might be wrong

Chain-of-thought prompting:
Q: If a shirt costs $45 and is 30% off, then you apply a $5 coupon, what do you pay?
   Let's think step by step.
A: Step 1: 30% off $45 = 0.30 × 45 = $13.50 discount
   Step 2: Discounted price = $45 - $13.50 = $31.50
   Step 3: Apply $5 coupon = $31.50 - $5.00 = $26.50
   Final answer: $26.50  ← same answer, but reliably correct

For this simple example, both might work. But for genuinely complex reasoning chains, CoT dramatically improves accuracy.

Why Thinking Out Loud Works

The reason CoT helps isn’t that the model is “trying harder.” It’s structural.

When you force the model to generate intermediate reasoning steps, you’re exploiting two properties of autoregressive generation:

Each generated token is visible to subsequent generation: When the model writes “Step 1: 30% of $45 =$ 13.50”, that intermediate result is now in the context for all subsequent generation. The model “sees” the correct intermediate value when computing Step 2.
Working memory through the context: Language models don’t have working memory beyond what’s in the context. CoT effectively creates working memory by externalizing the reasoning process into the output.

Without CoT, the model must solve a multi-step problem “in one shot” using only its internal representations. With CoT, each step builds on the explicitly written results of previous steps.

Zero-Shot CoT: Just Say “Think Step by Step”

The simplest approach requires no examples. Just append a reasoning trigger phrase:

"Think step by step."
"Let's work through this carefully."
"Break this down systematically."
"Reason through each part before giving your final answer."

The phrase “Let’s think step by step” was identified in a 2022 paper as surprisingly effective — almost a magic incantation for activating reasoning behavior in large models.

Works best for:

Math word problems
Logic puzzles
Code debugging
Multi-step planning
Scientific reasoning

Few-Shot CoT: Demonstrating Reasoning Patterns

For more reliable results, show examples of the full reasoning process:

Q: A train travels 150 miles in 2.5 hours. At the same speed,
   how long to travel 390 miles?
A: First, I need the speed: 150 miles / 2.5 hours = 60 mph.
   Then time for 390 miles: 390 / 60 = 6.5 hours.
   Answer: 6.5 hours

Q: Maria has 3 times as many stamps as Tom. Together they have
   120 stamps. How many does Maria have?
A: Let Tom's stamps = t. Then Maria's = 3t.
   Total: t + 3t = 4t = 120
   So t = 30, Maria has 3 × 30 = 90 stamps.
   Answer: 90 stamps

Q: [your actual problem here]
A:

The model will follow the same pattern of showing work before giving a final answer.

When CoT Helps (and When It Doesn’t)

CoT helps significantly:

Arithmetic and math word problems
Multi-step logical deductions
Code with bugs that require tracing execution
Legal/policy reasoning that requires applying rules to facts
Science problems requiring formula application

CoT has modest impact:

Simple factual recall (“What is the capital of Germany?”)
Straightforward classification (sentiment of “I love this!”)
Direct translation

CoT can hurt:

Simple tasks where reasoning adds noise
Very small models (< 7B params) — insufficient capacity to generate useful reasoning
When the model’s “reasoning” is post-hoc rationalization of a wrong answer (self-consistency sampling helps here)

Self-Consistency: Multiple Reasoning Paths

One of the most powerful extensions of CoT is self-consistency sampling. Instead of taking the first answer, you generate multiple reasoning chains (with temperature > 0) and take the majority vote.

Generate 5 solutions to the same problem at temperature 0.7:
  Chain 1 → $26.50 ✓
  Chain 2 → $26.50 ✓
  Chain 3 → $31.50 ✗ (forgot the coupon)
  Chain 4 → $26.50 ✓
  Chain 5 → $26.50 ✓

Majority vote: $26.50 → much more reliable than single-sample

Self-consistency adds cost (5× more tokens) but can dramatically improve accuracy on hard reasoning tasks. Used in production by systems that need high reliability on math or logic.

Tree of Thought (ToT)

An extension that creates a search tree of reasoning paths rather than a linear chain. The model proposes multiple reasoning steps at each decision point, evaluates them, and expands the most promising ones.

Problem
   │
   ├── Approach A → [evaluate: promising]
   │       ├── Step A.1 → [evaluate: dead end] ✗
   │       └── Step A.2 → [evaluate: promising] ✓
   │               └── Final answer (from branch A.2)
   │
   └── Approach B → [evaluate: weaker]

ToT is computationally expensive (many calls per problem) but approaches expert human performance on problems like the 24-game or creative writing puzzles that require strategic search.

Reasoning Models: CoT Built In

A significant shift in 2024–2025: the emergence of reasoning models that do chain-of-thought internally, before generating their response.

OpenAI o1/o3: Generates a long internal “thinking” trace (not shown to users by default) before answering. Dramatically better at math, science, and multi-step logic.
Claude 3.7 Sonnet (Extended Thinking): Similar internal reasoning capability, optionally surfaceable to users.
Gemini 2.0 Flash Thinking: Built-in reasoning with visible thought process.
DeepSeek-R1: Open-source reasoning model that uses GRPO to learn reasoning behavior.

For these models, you don’t need to add “think step by step” — they reason internally as standard practice. For earlier-generation models (GPT-4, Claude 3 Sonnet), CoT prompting remains highly effective.

Practical Implementation

# Using Claude or OpenAI API
prompt = """
Solve this problem step by step. Show each calculation.
At the end, state your final answer clearly.

Problem: A company's revenue grew 15% in Q1, declined 8% in Q2,
and grew 20% in Q3. Starting from $1,000,000 in revenue,
what is the Q3 ending revenue?
"""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    messages=[{"role": "user", "content": prompt}]
)

The model’s reasoning process is now part of its output, making it easier to:

Verify that the model reached the answer correctly (not by luck)
Identify exactly where a wrong answer went astray
Build user-facing explanations from the model’s own reasoning