Context Window Management
Token limits, prioritization, summarization, and sliding windows.
Learning Objectives
- Understand token limits for each model
- Implement summarization strategies
- Design sliding window approaches
Context Window Management
The context window is the single most important constraint in any Claude-based system. Every token you send — system prompts, user messages, tool definitions, tool results, conversation history, and the model's own responses — competes for space inside a fixed-size window. Effective context window management is the difference between a system that works reliably in production and one that silently degrades as conversations grow longer.
Understanding Token Limits
Claude models have specific context window sizes that define the maximum number of tokens that can be processed in a single API call. This includes both the input (everything you send) and the output (what the model generates).
- Claude 3.5 Sonnet / Claude 3 Opus: 200,000 token context window
- Claude 3 Haiku: 200,000 token context window
- Max output tokens: Varies by model, typically 4,096 to 8,192 tokens by default, configurable up to the model maximum with
max_tokens
A common misconception is that 200K tokens means you have 200K tokens of "space" for your content. In reality, the context window is shared between input and output. If your input consumes 195K tokens, you have at most 5K tokens left for the model's response — and the model may refuse to generate a useful answer or truncate its output.
Token Counting and Budget Planning
Before designing your system, you need to understand how tokens are allocated. Here is a practical approach to token budgeting:
import anthropic
client = anthropic.Anthropic()
# Count tokens for a message before sending it
def count_message_tokens(messages, system_prompt="", tools=None):
"""Estimate token usage before making an API call."""
count_params = {
"model": "claude-sonnet-4-20250514",
"messages": messages,
}
if system_prompt:
count_params["system"] = system_prompt
if tools:
count_params["tools"] = tools
response = client.messages.count_tokens(**count_params)
return response.input_tokens
# Example: Budget planning for a conversation
system_prompt = "You are a helpful assistant for customer support."
messages = [
{"role": "user", "content": "What is your return policy?"},
{"role": "assistant", "content": "Our return policy allows returns within 30 days..."},
{"role": "user", "content": "Can I return an opened item?"},
]
token_count = count_message_tokens(messages, system_prompt)
print(f"Current input tokens: {token_count}")
# Calculate remaining budget
MAX_CONTEXT = 200000
RESERVED_FOR_OUTPUT = 4096
available_for_input = MAX_CONTEXT - RESERVED_FOR_OUTPUT
remaining = available_for_input - token_count
print(f"Remaining input budget: {remaining} tokens")Context Window Anatomy
Understanding what occupies your context window is essential for managing it. Here is a typical breakdown for an agentic system:
# Context Window Layout (200K tokens total)
# ┌─────────────────────────────────────────────────┐
# │ System Prompt ~500-2000 tokens │
# │ Tool Definitions (10 tools) ~3000-5000 tokens │
# │ Conversation History (growing) ~variable │
# │ - User messages │
# │ - Assistant responses │
# │ - Tool calls and results │
# │ Current User Message ~100-5000 tokens │
# │ ───────────────────────────────────────────────── │
# │ Reserved for Output ~4096 tokens │
# └─────────────────────────────────────────────────┘Tool definitions are a frequently overlooked source of token consumption. Each tool definition — including its name, description, and full JSON schema — can consume hundreds of tokens. Ten tools with detailed schemas can easily consume 3,000-5,000 tokens before a single message is exchanged.
Token Prioritization Strategies
When the context window fills up, you must decide what to keep and what to remove. The priority order should generally be:
- Highest priority — System prompt: The model's instructions and persona. Without these, behavior becomes unpredictable.
- High priority — Tool definitions: Required for the model to call tools. Only include tools relevant to the current task.
- High priority — Recent conversation turns: The last 2-4 exchanges provide immediate context for the current question.
- Medium priority — Key earlier context: Important facts, decisions, or user preferences established earlier in the conversation.
- Low priority — Verbose tool results: Full API responses, large documents, or detailed search results that have already been processed.
- Lowest priority — Redundant or stale information: Repeated greetings, superseded instructions, or corrected mistakes.
Conversation Summarization
As conversations grow, summarization is the most powerful technique for managing context. The idea is simple: periodically replace older conversation turns with a concise summary that preserves the essential information.
import anthropic
client = anthropic.Anthropic()
def summarize_conversation(messages, keep_recent=4):
"""
Summarize older messages while keeping recent turns intact.
Args:
messages: Full conversation history
keep_recent: Number of recent message pairs to preserve verbatim
Returns:
Condensed message list with summary + recent messages
"""
if len(messages) <= keep_recent * 2:
return messages # Not enough history to warrant summarization
# Split into old messages (to summarize) and recent (to keep)
old_messages = messages[:-(keep_recent * 2)]
recent_messages = messages[-(keep_recent * 2):]
# Build a summary of the older conversation
summary_prompt = (
"Summarize the following conversation history. Preserve:\n"
"1. Key facts and decisions made\n"
"2. User preferences and constraints mentioned\n"
"3. Any unresolved questions or pending tasks\n"
"4. Important context needed for future turns\n\n"
"Be concise but do not lose critical information.\n\n"
"Conversation to summarize:\n"
)
for msg in old_messages:
summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n"
summary_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": summary_prompt}],
)
summary_text = summary_response.content[0].text
# Reconstruct the conversation with the summary as the first message
condensed = [
{
"role": "user",
"content": f"[Conversation Summary] {summary_text}",
},
{
"role": "assistant",
"content": "Understood. I have the context from our previous conversation.",
},
] + recent_messages
return condensed
# Usage in an agent loop
conversation = []
MAX_INPUT_TOKENS = 150000 # Leave room for output and tools
def agent_turn(user_message, system_prompt, tools):
conversation.append({"role": "user", "content": user_message})
# Check if we need to summarize
token_count = count_message_tokens(conversation, system_prompt, tools)
if token_count > MAX_INPUT_TOKENS:
print("Context limit approaching — summarizing older history...")
condensed = summarize_conversation(conversation)
conversation.clear()
conversation.extend(condensed)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=conversation,
)
assistant_message = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_message})
return assistant_messageSliding Window Approach
A sliding window is a simpler alternative to summarization. Instead of summarizing older turns, you simply drop them. This is effective when earlier conversation context is genuinely not needed, such as in stateless Q&A systems.
def sliding_window(messages, max_turns=20):
"""
Keep only the most recent N turns of conversation.
Always preserves the system context message if present.
"""
if len(messages) <= max_turns * 2:
return messages
# Check if the first message is a context/summary message
has_context_prefix = (
messages[0]["role"] == "user"
and messages[0]["content"].startswith("[Conversation Summary]")
)
if has_context_prefix:
# Keep context message + assistant ack + recent turns
return messages[:2] + messages[-(max_turns * 2):]
else:
# Just keep recent turns
return messages[-(max_turns * 2):]
def hybrid_window(messages, token_budget, system_prompt, tools):
"""
Adaptive sliding window based on actual token count.
Drops oldest messages until under budget.
"""
while len(messages) > 2:
tokens = count_message_tokens(messages, system_prompt, tools)
if tokens <= token_budget:
break
# Remove the oldest user-assistant pair
messages = messages[2:]
return messagesDynamic Tool Selection
Another often-overlooked strategy is to dynamically include only the tools relevant to the current phase of a conversation. Rather than sending all 20 tools on every API call, select a subset based on the task at hand.
# Instead of sending all tools every time:
all_tools = [search_tool, calendar_tool, email_tool, database_tool,
file_tool, calculator_tool, weather_tool, translate_tool]
# Dynamically select relevant tools based on context
def select_tools(user_message, conversation_phase):
"""Select only relevant tools to reduce token usage."""
base_tools = [search_tool] # Always available
if conversation_phase == "scheduling":
return base_tools + [calendar_tool, email_tool]
elif conversation_phase == "data_analysis":
return base_tools + [database_tool, calculator_tool]
elif conversation_phase == "file_management":
return base_tools + [file_tool]
else:
# For general queries, use a smaller default set
return base_tools
# This can save 2000-4000 tokens per API callExam Tip: The exam frequently tests your understanding of what consumes tokens in the context window. Remember that tool definitions, system prompts, and tool results all count toward the input token limit. A common trick question asks about a system that "works in testing" (short conversations) but "fails in production" (long conversations) — the answer is almost always context window exhaustion. Know that the fix involves summarization, sliding windows, or dynamic tool selection — not simply increasing max_tokens (which only controls output length).
Exam Tip: When asked about the tradeoff between summarization and sliding windows, remember: summarization preserves information but costs an extra API call (latency and money). Sliding windows are free but lose information permanently. The best approach depends on whether the older context is likely to be needed again.
Key Takeaways
The context window is a shared, finite resource. Input tokens (system prompt + tools + history + current message) and output tokens all compete for space within the model's maximum context size.
Token budgeting is essential for production systems. Use the token counting API to monitor usage and trigger context management strategies before hitting limits.
Summarization preserves information; sliding windows trade information for simplicity. Choose based on whether older context matters for your use case.
Dynamic tool selection reduces per-call token usage by only including tools relevant to the current task phase, freeing context space for actual conversation content.