Chain of Thought Prompting
Here’s a puzzle. You ask an LLM a multi-step math problem and it gets it wrong. You ask the exact same question but add “think step by step” — and it gets it right. Why?
This is chain-of-thought (CoT) prompting, and understanding why it works helps you apply it correctly and know when to use it.
The Core Idea
Standard prompting asks the model to go directly from question to answer. Chain-of-thought prompting encourages the model to show its intermediate reasoning first, then arrive at a final answer.
Standard prompting:Q: If a shirt costs $45 and is 30% off, then you apply a $5 coupon, what do you pay?A: $26.50 ← might be wrong
Chain-of-thought prompting:Q: If a shirt costs $45 and is 30% off, then you apply a $5 coupon, what do you pay? Let's think step by step.A: Step 1: 30% off $45 = 0.30 × 45 = $13.50 discount Step 2: Discounted price = $45 - $13.50 = $31.50 Step 3: Apply $5 coupon = $31.50 - $5.00 = $26.50 Final answer: $26.50 ← same answer, but reliably correctFor this simple example, both might work. But for genuinely complex reasoning chains, CoT dramatically improves accuracy.
Why Thinking Out Loud Works
The reason CoT helps isn’t that the model is “trying harder.” It’s structural.
When you force the model to generate intermediate reasoning steps, you’re exploiting two properties of autoregressive generation:
-
Each generated token is visible to subsequent generation: When the model writes “Step 1: 30% of 13.50”, that intermediate result is now in the context for all subsequent generation. The model “sees” the correct intermediate value when computing Step 2.
-
Working memory through the context: Language models don’t have working memory beyond what’s in the context. CoT effectively creates working memory by externalizing the reasoning process into the output.
Without CoT, the model must solve a multi-step problem “in one shot” using only its internal representations. With CoT, each step builds on the explicitly written results of previous steps.
Zero-Shot CoT: Just Say “Think Step by Step”
The simplest approach requires no examples. Just append a reasoning trigger phrase:
"Think step by step.""Let's work through this carefully.""Break this down systematically.""Reason through each part before giving your final answer."The phrase “Let’s think step by step” was identified in a 2022 paper as surprisingly effective — almost a magic incantation for activating reasoning behavior in large models.
Works best for:
- Math word problems
- Logic puzzles
- Code debugging
- Multi-step planning
- Scientific reasoning
Few-Shot CoT: Demonstrating Reasoning Patterns
For more reliable results, show examples of the full reasoning process:
Q: A train travels 150 miles in 2.5 hours. At the same speed, how long to travel 390 miles?A: First, I need the speed: 150 miles / 2.5 hours = 60 mph. Then time for 390 miles: 390 / 60 = 6.5 hours. Answer: 6.5 hours
Q: Maria has 3 times as many stamps as Tom. Together they have 120 stamps. How many does Maria have?A: Let Tom's stamps = t. Then Maria's = 3t. Total: t + 3t = 4t = 120 So t = 30, Maria has 3 × 30 = 90 stamps. Answer: 90 stamps
Q: [your actual problem here]A:The model will follow the same pattern of showing work before giving a final answer.
When CoT Helps (and When It Doesn’t)
CoT helps significantly:
- Arithmetic and math word problems
- Multi-step logical deductions
- Code with bugs that require tracing execution
- Legal/policy reasoning that requires applying rules to facts
- Science problems requiring formula application
CoT has modest impact:
- Simple factual recall (“What is the capital of Germany?”)
- Straightforward classification (sentiment of “I love this!”)
- Direct translation
CoT can hurt:
- Simple tasks where reasoning adds noise
- Very small models (< 7B params) — insufficient capacity to generate useful reasoning
- When the model’s “reasoning” is post-hoc rationalization of a wrong answer (self-consistency sampling helps here)
Self-Consistency: Multiple Reasoning Paths
One of the most powerful extensions of CoT is self-consistency sampling. Instead of taking the first answer, you generate multiple reasoning chains (with temperature > 0) and take the majority vote.
Generate 5 solutions to the same problem at temperature 0.7: Chain 1 → $26.50 ✓ Chain 2 → $26.50 ✓ Chain 3 → $31.50 ✗ (forgot the coupon) Chain 4 → $26.50 ✓ Chain 5 → $26.50 ✓
Majority vote: $26.50 → much more reliable than single-sampleSelf-consistency adds cost (5× more tokens) but can dramatically improve accuracy on hard reasoning tasks. Used in production by systems that need high reliability on math or logic.
Tree of Thought (ToT)
An extension that creates a search tree of reasoning paths rather than a linear chain. The model proposes multiple reasoning steps at each decision point, evaluates them, and expands the most promising ones.
Problem │ ├── Approach A → [evaluate: promising] │ ├── Step A.1 → [evaluate: dead end] ✗ │ └── Step A.2 → [evaluate: promising] ✓ │ └── Final answer (from branch A.2) │ └── Approach B → [evaluate: weaker]ToT is computationally expensive (many calls per problem) but approaches expert human performance on problems like the 24-game or creative writing puzzles that require strategic search.
Reasoning Models: CoT Built In
A significant shift in 2024–2025: the emergence of reasoning models that do chain-of-thought internally, before generating their response.
- OpenAI o1/o3: Generates a long internal “thinking” trace (not shown to users by default) before answering. Dramatically better at math, science, and multi-step logic.
- Claude 3.7 Sonnet (Extended Thinking): Similar internal reasoning capability, optionally surfaceable to users.
- Gemini 2.0 Flash Thinking: Built-in reasoning with visible thought process.
- DeepSeek-R1: Open-source reasoning model that uses GRPO to learn reasoning behavior.
For these models, you don’t need to add “think step by step” — they reason internally as standard practice. For earlier-generation models (GPT-4, Claude 3 Sonnet), CoT prompting remains highly effective.
Practical Implementation
# Using Claude or OpenAI APIprompt = """Solve this problem step by step. Show each calculation.At the end, state your final answer clearly.
Problem: A company's revenue grew 15% in Q1, declined 8% in Q2,and grew 20% in Q3. Starting from $1,000,000 in revenue,what is the Q3 ending revenue?"""
response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=512, messages=[{"role": "user", "content": prompt}])The model’s reasoning process is now part of its output, making it easier to:
- Verify that the model reached the answer correctly (not by luck)
- Identify exactly where a wrong answer went astray
- Build user-facing explanations from the model’s own reasoning