✍️Prompt EngineeringLesson 4.4

Extended Thinking & Complex Reasoning

Using thinking blocks and budget tokens for complex tasks.

20 min

Learning Objectives

Enable extended thinking for complex reasoning
Configure budget tokens appropriately
Know when extended thinking improves results

Extended Thinking and Complex Reasoning

Extended Thinking is an Anthropic-specific feature that gives Claude a dedicated, separate space to reason through complex problems before producing its visible response. Unlike chain-of-thought prompting (where reasoning appears in the output), Extended Thinking uses a special thinking block with its own token budget. This feature is particularly important for the CCA-F exam because it represents a distinct architectural decision with specific trade-offs.

How Extended Thinking Works

When you enable Extended Thinking, Claude's response includes two types of content blocks:

Thinking blocks: Internal reasoning that Claude uses to work through the problem. These are visible to your application but are not shown to end users by default. Think of them as “scratch paper.”
Text blocks: The final, polished response that incorporates the insights from the thinking process.

Enabling Extended Thinking

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Maximum tokens for thinking
    },
    messages=[{"role": "user", "content": (
        "Analyze the following business scenario and recommend a strategy.\n\n"
        "A mid-size SaaS company (ARR $50M) is seeing:\n"
        "- Customer churn increasing from 5% to 8% quarterly\n"
        "- Net revenue retention dropping to 95%\n"
        "- Sales cycle lengthening from 30 to 45 days\n"
        "- Support ticket volume up 40% YoY\n\n"
        "Diagnose the likely root causes and recommend a prioritized "
        "action plan with expected impact for each action."
    )}]
)

# Process the response
for block in response.content:
    if block.type == "thinking":
        print("THINKING:", block.thinking[:200], "...")
    elif block.type == "text":
        print("RESPONSE:", block.text)

Budget Tokens

The budget_tokens parameter controls how much reasoning space Claude has. This is a critical architectural decision:

Minimum: 1,024 tokens. Below this, Extended Thinking cannot be enabled.
Practical range: 5,000 to 30,000 tokens for most tasks.
Maximum: Must be less than max_tokens. The total response (thinking + visible output) cannot exceed max_tokens.

Budget Sizing Guidelines

Simple analysis (5,000-8,000): Code review, basic reasoning, straightforward classification with justification.
Moderate complexity (10,000-20,000): Multi-step math, strategic analysis, architectural decisions, debugging complex issues.
High complexity (20,000-50,000): Advanced mathematical proofs, multi-factor decision analysis, complex code generation with error handling.

Exam Tip: The exam tests key constraints of Extended Thinking: (1) budget_tokens must be at least 1,024, (2) budget_tokensmust be less than max_tokens, (3) Extended Thinking is NOT compatible with prefilling (you cannot include an assistant message at the end of the messages array), (4) temperature must be set to 1 when Extended Thinking is enabled (you cannot lower it), and (5) thinking blocks are not guaranteed to appear in every response.

When to Use Extended Thinking

Good Candidates for Extended Thinking

Multi-step mathematical reasoning: Problems that require carrying intermediate results across several steps.
Complex code generation: Tasks where Claude needs to consider multiple approaches, edge cases, and interactions before committing to an implementation.
Strategic analysis: Business decisions with multiple factors, trade-offs, and stakeholder perspectives.
Debugging: Tracing through code logic to identify the root cause of a subtle bug.
Architectural planning: Designing system architectures where components interact in complex ways.

Poor Candidates for Extended Thinking

Simple extraction or classification: Tasks where the answer is directly in the input text. Extended Thinking adds latency and cost without improving accuracy.
Text generation: Creative writing, summarization, and translation do not benefit from extended reasoning.
Structured output formatting: If you just need Claude to convert data into JSON, thinking tokens are wasted.
High-throughput, low-latency pipelines: Extended Thinking adds significant latency. If you are processing thousands of items and speed matters, skip it.

Extended Thinking in Multi-Turn Conversations

In multi-turn conversations, thinking blocks from previous turns are not sent back to the API. Only the visible text content from assistant messages is included in subsequent requests. This means Claude does not remember its earlier reasoning — only its conclusions.

import anthropic

client = anthropic.Anthropic()

# Turn 1: Initial analysis with thinking
response1 = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Analyze this algorithm for time complexity.\n\ndef mystery(arr):\n    n = len(arr)\n    for i in range(n):\n        for j in range(i, n):\n            if arr[j] < arr[i]:\n                arr[i], arr[j] = arr[j], arr[i]\n    return arr"}]
)

# Extract only the text content for the next turn
assistant_text = ""
for block in response1.content:
    if block.type == "text":
        assistant_text = block.text

# Turn 2: Follow-up question
# Note: thinking blocks from turn 1 are NOT included
response2 = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[
        {"role": "user", "content": "Analyze this algorithm for time complexity.\n\ndef mystery(arr):\n    n = len(arr)\n    for i in range(n):\n        for j in range(i, n):\n            if arr[j] < arr[i]:\n                arr[i], arr[j] = arr[j], arr[i]\n    return arr"},
        {"role": "assistant", "content": assistant_text},
        {"role": "user", "content": "Can you optimize this to O(n log n)?"}
    ]
)

Exam Tip: A common exam question tests whether you understand that thinking blocks are ephemeral within a session. They are returned in the API response so your application can log or display them, but they are not persisted across turns. If you need reasoning from a previous turn, you must extract it from the thinking blocks and include it in the next user message manually.

Extended Thinking vs. Chain-of-Thought

This distinction is critical for the exam:

Chain-of-Thought (CoT): A prompting technique where you ask Claude to show its reasoning in the visible output. The reasoning is part of the text response. No special API parameters needed. Works with all models and configurations. Reasoning and answer share the same max_tokens budget.
Extended Thinking: An API feature with a dedicated thinking budget. Reasoning happens in a separate thinking block, not in the visible output. Requires specific API parameters. Has constraints (no prefilling, temperature = 1). Thinking gets its own token budget separate from the response.

Decision Matrix

Use CoT when: you want visible reasoning, need prefilling, need temperature control, or are processing high-volume tasks where latency matters.
Use Extended Thinking when: the task genuinely requires deep reasoning, you want a clean response without visible working, and you can accept the latency and cost.
Use neither when: the task is straightforward extraction, classification, or generation that does not benefit from step-by-step reasoning.

Streaming with Extended Thinking

Extended Thinking works with streaming. Thinking tokens stream first, followed by the text response. You can display a “thinking” indicator while thinking tokens arrive and then switch to displaying the response.

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Solve this step by step: If a train leaves Chicago at 9am traveling 60mph and another leaves New York at 10am traveling 80mph, when do they meet? The distance is 790 miles."}]
) as stream:
    current_block_type = None
    for event in stream:
        if hasattr(event, "type"):
            if event.type == "content_block_start":
                if event.content_block.type == "thinking":
                    current_block_type = "thinking"
                    print("[Thinking...]")
                elif event.content_block.type == "text":
                    current_block_type = "text"
                    print("\n[Response]")
            elif event.type == "content_block_delta":
                if current_block_type == "text" and hasattr(event.delta, "text"):
                    print(event.delta.text, end="", flush=True)

Key Takeaway: Extended Thinking is a powerful tool for complex reasoning tasks, but it is not a universal improvement. It adds latency and cost, is incompatible with prefilling and temperature control, and does not persist across turns. Use it selectively for tasks that genuinely require deep, multi-step reasoning. For most production workloads — extraction, classification, formatting, simple generation — standard prompting or chain-of-thought produces equivalent results faster and cheaper.