🛡️Context & ReliabilityLesson 5.1

Context Window Management

Token limits, prioritization, summarization, and sliding windows.

25 min

Learning Objectives

  • Understand token limits for each model
  • Implement summarization strategies
  • Design sliding window approaches

Context Window Management

The context window is the single most important constraint in any Claude-based system. Every token you send — system prompts, user messages, tool definitions, tool results, conversation history, and the model's own responses — competes for space inside a fixed-size window. Effective context window management is the difference between a system that works reliably in production and one that silently degrades as conversations grow longer.

Understanding Token Limits

Claude models have specific context window sizes that define the maximum number of tokens that can be processed in a single API call. This includes both the input (everything you send) and the output (what the model generates).

  • Claude 3.5 Sonnet / Claude 3 Opus: 200,000 token context window
  • Claude 3 Haiku: 200,000 token context window
  • Max output tokens: Varies by model, typically 4,096 to 8,192 tokens by default, configurable up to the model maximum with max_tokens

A common misconception is that 200K tokens means you have 200K tokens of "space" for your content. In reality, the context window is shared between input and output. If your input consumes 195K tokens, you have at most 5K tokens left for the model's response — and the model may refuse to generate a useful answer or truncate its output.

Token Counting and Budget Planning

Before designing your system, you need to understand how tokens are allocated. Here is a practical approach to token budgeting:

import anthropic

client = anthropic.Anthropic()

# Count tokens for a message before sending it
def count_message_tokens(messages, system_prompt="", tools=None):
    """Estimate token usage before making an API call."""
    count_params = {
        "model": "claude-sonnet-4-20250514",
        "messages": messages,
    }
    if system_prompt:
        count_params["system"] = system_prompt
    if tools:
        count_params["tools"] = tools

    response = client.messages.count_tokens(**count_params)
    return response.input_tokens


# Example: Budget planning for a conversation
system_prompt = "You are a helpful assistant for customer support."
messages = [
    {"role": "user", "content": "What is your return policy?"},
    {"role": "assistant", "content": "Our return policy allows returns within 30 days..."},
    {"role": "user", "content": "Can I return an opened item?"},
]

token_count = count_message_tokens(messages, system_prompt)
print(f"Current input tokens: {token_count}")

# Calculate remaining budget
MAX_CONTEXT = 200000
RESERVED_FOR_OUTPUT = 4096
available_for_input = MAX_CONTEXT - RESERVED_FOR_OUTPUT
remaining = available_for_input - token_count
print(f"Remaining input budget: {remaining} tokens")

Context Window Anatomy

Understanding what occupies your context window is essential for managing it. Here is a typical breakdown for an agentic system:

# Context Window Layout (200K tokens total)
# ┌─────────────────────────────────────────────────┐
# │ System Prompt                    ~500-2000 tokens │
# │ Tool Definitions (10 tools)      ~3000-5000 tokens │
# │ Conversation History (growing)   ~variable        │
# │   - User messages                                 │
# │   - Assistant responses                            │
# │   - Tool calls and results                         │
# │ Current User Message             ~100-5000 tokens  │
# │ ─────────────────────────────────────────────────  │
# │ Reserved for Output              ~4096 tokens      │
# └─────────────────────────────────────────────────┘

Tool definitions are a frequently overlooked source of token consumption. Each tool definition — including its name, description, and full JSON schema — can consume hundreds of tokens. Ten tools with detailed schemas can easily consume 3,000-5,000 tokens before a single message is exchanged.

Token Prioritization Strategies

When the context window fills up, you must decide what to keep and what to remove. The priority order should generally be:

  • Highest priority — System prompt: The model's instructions and persona. Without these, behavior becomes unpredictable.
  • High priority — Tool definitions: Required for the model to call tools. Only include tools relevant to the current task.
  • High priority — Recent conversation turns: The last 2-4 exchanges provide immediate context for the current question.
  • Medium priority — Key earlier context: Important facts, decisions, or user preferences established earlier in the conversation.
  • Low priority — Verbose tool results: Full API responses, large documents, or detailed search results that have already been processed.
  • Lowest priority — Redundant or stale information: Repeated greetings, superseded instructions, or corrected mistakes.

Conversation Summarization

As conversations grow, summarization is the most powerful technique for managing context. The idea is simple: periodically replace older conversation turns with a concise summary that preserves the essential information.

import anthropic

client = anthropic.Anthropic()

def summarize_conversation(messages, keep_recent=4):
    """
    Summarize older messages while keeping recent turns intact.

    Args:
        messages: Full conversation history
        keep_recent: Number of recent message pairs to preserve verbatim
    Returns:
        Condensed message list with summary + recent messages
    """
    if len(messages) <= keep_recent * 2:
        return messages  # Not enough history to warrant summarization

    # Split into old messages (to summarize) and recent (to keep)
    old_messages = messages[:-(keep_recent * 2)]
    recent_messages = messages[-(keep_recent * 2):]

    # Build a summary of the older conversation
    summary_prompt = (
        "Summarize the following conversation history. Preserve:\n"
        "1. Key facts and decisions made\n"
        "2. User preferences and constraints mentioned\n"
        "3. Any unresolved questions or pending tasks\n"
        "4. Important context needed for future turns\n\n"
        "Be concise but do not lose critical information.\n\n"
        "Conversation to summarize:\n"
    )
    for msg in old_messages:
        summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n"

    summary_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": summary_prompt}],
    )

    summary_text = summary_response.content[0].text

    # Reconstruct the conversation with the summary as the first message
    condensed = [
        {
            "role": "user",
            "content": f"[Conversation Summary] {summary_text}",
        },
        {
            "role": "assistant",
            "content": "Understood. I have the context from our previous conversation.",
        },
    ] + recent_messages

    return condensed


# Usage in an agent loop
conversation = []
MAX_INPUT_TOKENS = 150000  # Leave room for output and tools

def agent_turn(user_message, system_prompt, tools):
    conversation.append({"role": "user", "content": user_message})

    # Check if we need to summarize
    token_count = count_message_tokens(conversation, system_prompt, tools)
    if token_count > MAX_INPUT_TOKENS:
        print("Context limit approaching — summarizing older history...")
        condensed = summarize_conversation(conversation)
        conversation.clear()
        conversation.extend(condensed)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=system_prompt,
        tools=tools,
        messages=conversation,
    )

    assistant_message = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_message})
    return assistant_message

Sliding Window Approach

A sliding window is a simpler alternative to summarization. Instead of summarizing older turns, you simply drop them. This is effective when earlier conversation context is genuinely not needed, such as in stateless Q&A systems.

def sliding_window(messages, max_turns=20):
    """
    Keep only the most recent N turns of conversation.
    Always preserves the system context message if present.
    """
    if len(messages) <= max_turns * 2:
        return messages

    # Check if the first message is a context/summary message
    has_context_prefix = (
        messages[0]["role"] == "user"
        and messages[0]["content"].startswith("[Conversation Summary]")
    )

    if has_context_prefix:
        # Keep context message + assistant ack + recent turns
        return messages[:2] + messages[-(max_turns * 2):]
    else:
        # Just keep recent turns
        return messages[-(max_turns * 2):]


def hybrid_window(messages, token_budget, system_prompt, tools):
    """
    Adaptive sliding window based on actual token count.
    Drops oldest messages until under budget.
    """
    while len(messages) > 2:
        tokens = count_message_tokens(messages, system_prompt, tools)
        if tokens <= token_budget:
            break
        # Remove the oldest user-assistant pair
        messages = messages[2:]
    return messages

Dynamic Tool Selection

Another often-overlooked strategy is to dynamically include only the tools relevant to the current phase of a conversation. Rather than sending all 20 tools on every API call, select a subset based on the task at hand.

# Instead of sending all tools every time:
all_tools = [search_tool, calendar_tool, email_tool, database_tool,
             file_tool, calculator_tool, weather_tool, translate_tool]

# Dynamically select relevant tools based on context
def select_tools(user_message, conversation_phase):
    """Select only relevant tools to reduce token usage."""
    base_tools = [search_tool]  # Always available

    if conversation_phase == "scheduling":
        return base_tools + [calendar_tool, email_tool]
    elif conversation_phase == "data_analysis":
        return base_tools + [database_tool, calculator_tool]
    elif conversation_phase == "file_management":
        return base_tools + [file_tool]
    else:
        # For general queries, use a smaller default set
        return base_tools

# This can save 2000-4000 tokens per API call

Exam Tip: The exam frequently tests your understanding of what consumes tokens in the context window. Remember that tool definitions, system prompts, and tool results all count toward the input token limit. A common trick question asks about a system that "works in testing" (short conversations) but "fails in production" (long conversations) — the answer is almost always context window exhaustion. Know that the fix involves summarization, sliding windows, or dynamic tool selection — not simply increasing max_tokens (which only controls output length).

Exam Tip: When asked about the tradeoff between summarization and sliding windows, remember: summarization preserves information but costs an extra API call (latency and money). Sliding windows are free but lose information permanently. The best approach depends on whether the older context is likely to be needed again.

Key Takeaways

The context window is a shared, finite resource. Input tokens (system prompt + tools + history + current message) and output tokens all compete for space within the model's maximum context size.

Token budgeting is essential for production systems. Use the token counting API to monitor usage and trigger context management strategies before hitting limits.

Summarization preserves information; sliding windows trade information for simplicity. Choose based on whether older context matters for your use case.

Dynamic tool selection reduces per-call token usage by only including tools relevant to the current task phase, freeing context space for actual conversation content.