🛡️Context & ReliabilityLesson 5.3

Prompt Caching Strategies

Reducing latency and cost with prompt caching.

15 min

Learning Objectives

Understand how prompt caching works
Identify cacheable prompt components
Optimize cache hit rates for cost savings

Prompt Caching Strategies

Prompt caching is a performance and cost optimization feature that allows the Anthropic API to reuse previously processed prompt prefixes across API calls. When enabled, Claude does not need to re-process the cached portion of your prompt on subsequent requests, resulting in lower latency and reduced costs. Understanding how caching works and how to structure prompts to maximize cache hits is essential for production systems.

How Prompt Caching Works

When you send a request with caching enabled, Anthropic's infrastructure checks whether the beginning of your prompt (the "prefix") matches a previously cached prompt. If it does, the cached computation is reused, and you are charged at a reduced rate for the cached tokens.

Cache write: The first request processes the full prompt and stores the prefix in cache. You pay a small premium (25% more than base input price) for the write.
Cache hit: Subsequent requests that share the same prefix reuse the cached computation. Cached tokens cost 90% less than base input price.
Cache lifetime: Cached prefixes have a time-to-live (TTL) of at least 5 minutes, refreshed on each hit. High-traffic prompts stay cached longer.
Minimum cacheable length: The cached prefix must be at least 1,024 tokens for Claude 3.5 Sonnet and Claude 3 Opus, or 2,048 tokens for Claude 3 Haiku.

The Prefix Matching Rule

Caching works on a prefix basis — the cached portion must start from the very beginning of the prompt and extend contiguously. This means the order of content in your prompt is critical for cache efficiency.

# GOOD: Static content first, dynamic content last
# The system prompt and tool definitions are identical across requests,
# so they form a cacheable prefix.
#
# ┌────────────────────────────┐
# │ System Prompt (cached)     │ ← Same every request
# │ Tool Definitions (cached)  │ ← Same every request
# │ Few-shot Examples (cached) │ ← Same every request
# │ ─── cache boundary ───     │
# │ Conversation History       │ ← Changes each request
# │ Current User Message       │ ← Changes each request
# └────────────────────────────┘
#
# BAD: Dynamic content mixed into the prefix
# ┌────────────────────────────┐
# │ System Prompt              │
# │ Current Date/Time          │ ← Changes every request! Breaks cache.
# │ Tool Definitions           │
# │ Conversation History       │
# └────────────────────────────┘

Implementing Prompt Caching

You enable caching by adding cache_control markers to your prompt content blocks. The marker tells the API: "cache everything up to and including this block."

import anthropic

client = anthropic.Anthropic()

# Example 1: Caching a large system prompt
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": (
                "You are an expert legal analyst. You specialize in contract "
                "review, regulatory compliance, and corporate governance. "
                "Follow these detailed guidelines for analysis...\n\n"
                # Imagine this is a very long, detailed system prompt
                # with analysis frameworks, output formats, etc.
                "... (2000+ tokens of detailed instructions) ..."
            ),
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Review section 3.2 of this contract..."},
    ],
)

# Check cache performance in the response
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

Caching Tool Definitions

Tool definitions are one of the best candidates for caching because they are typically identical across all requests in a session and can consume thousands of tokens.

import anthropic

client = anthropic.Anthropic()

# Define tools with cache_control on the last tool
tools = [
    {
        "name": "search_database",
        "description": "Search the product database with filters.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "category": {"type": "string", "description": "Product category"},
                "max_results": {"type": "integer", "description": "Max results"},
            },
            "required": ["query"],
        },
    },
    {
        "name": "get_product_details",
        "description": "Get full details for a specific product by ID.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string", "description": "Product ID"},
            },
            "required": ["product_id"],
        },
    },
    {
        "name": "check_inventory",
        "description": "Check current inventory levels for a product.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string", "description": "Product ID"},
                "warehouse": {"type": "string", "description": "Warehouse code"},
            },
            "required": ["product_id"],
        },
        # Place cache_control on the LAST tool to cache all tool definitions
        "cache_control": {"type": "ephemeral"},
    },
]

# All subsequent requests reuse the cached tool definitions
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Find me red shoes in stock"}],
)

Caching Few-Shot Examples and Reference Documents

If your prompt includes few-shot examples or large reference documents, these are excellent caching candidates because they remain constant across requests.

import anthropic

client = anthropic.Anthropic()

# Cache a large reference document + few-shot examples
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a medical coding assistant.",
        },
    ],
    messages=[
        # First message: reference material (cacheable)
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Here is the ICD-10 coding reference guide:\n\n"
                        "... (large reference document, 5000+ tokens) ...\n\n"
                        "And here are examples of correct coding:\n\n"
                        "Example 1: Patient presents with acute bronchitis "
                        "-> J20.9\n"
                        "Example 2: Type 2 diabetes with neuropathy "
                        "-> E11.40\n"
                        "... (more examples) ..."
                    ),
                    "cache_control": {"type": "ephemeral"},
                },
            ],
        },
        {
            "role": "assistant",
            "content": "I have reviewed the ICD-10 reference guide and examples. Ready to help with medical coding.",
        },
        # Subsequent messages: the actual query (changes per request)
        {
            "role": "user",
            "content": "Code this: Patient with chronic lower back pain and sciatica",
        },
    ],
)

Multi-Turn Conversation Caching

In multi-turn conversations, you can incrementally cache the growing conversation history. Each new turn extends the cached prefix.

import anthropic

client = anthropic.Anthropic()

def chat_with_caching(conversation_history, new_user_message):
    """
    Send a message with incremental conversation caching.

    The idea: mark the last message in the existing history
    with cache_control so the entire prefix is cached.
    On the next turn, the prefix (including this turn) is reused.
    """
    # Add cache_control to the last message in history
    messages = []
    for i, msg in enumerate(conversation_history):
        if i == len(conversation_history) - 1:
            # Mark the last historical message for caching
            messages.append({
                "role": msg["role"],
                "content": [
                    {
                        "type": "text",
                        "text": msg["content"],
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
            })
        else:
            messages.append(msg)

    # Add the new user message (not cached yet)
    messages.append({"role": "user", "content": new_user_message})

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=messages,
    )

    # Log cache performance
    usage = response.usage
    cached = getattr(usage, "cache_read_input_tokens", 0)
    total_input = usage.input_tokens
    if total_input > 0:
        cache_rate = (cached / total_input) * 100
        print(f"Cache hit rate: {cache_rate:.1f}%")

    return response.content[0].text

Cost Savings Calculation

Understanding the economics of prompt caching is important for justifying its use and for exam questions about cost optimization.

# Prompt Caching Cost Model
#
# Base input cost:    $3.00 per million tokens (example rate)
# Cache write cost:   $3.75 per million tokens (25% premium)
# Cache read cost:    $0.30 per million tokens (90% discount)
#
# Example: System prompt + tools = 5,000 tokens, 100 requests/hour
#
# WITHOUT caching:
#   5,000 tokens x 100 requests = 500,000 tokens/hour
#   Cost: 500,000 / 1,000,000 x $3.00 = $1.50/hour
#
# WITH caching:
#   First request (cache write): 5,000 tokens x $3.75/M = $0.01875
#   Remaining 99 requests (cache read): 5,000 x 99 = 495,000 tokens
#   Cost: 495,000 / 1,000,000 x $0.30 = $0.1485
#   Total: $0.01875 + $0.1485 = $0.167/hour
#
# Savings: $1.50 - $0.167 = $1.333/hour (89% reduction)

def estimate_cache_savings(
    cached_tokens,
    requests_per_hour,
    base_cost_per_million=3.00,
):
    """Estimate hourly cost savings from prompt caching."""
    write_premium = 1.25  # 25% more than base
    read_discount = 0.10  # 90% less than base

    # Without caching
    total_tokens = cached_tokens * requests_per_hour
    cost_without = (total_tokens / 1_000_000) * base_cost_per_million

    # With caching
    write_cost = (cached_tokens / 1_000_000) * base_cost_per_million * write_premium
    read_tokens = cached_tokens * (requests_per_hour - 1)
    read_cost = (read_tokens / 1_000_000) * base_cost_per_million * read_discount
    cost_with = write_cost + read_cost

    savings = cost_without - cost_with
    savings_pct = (savings / cost_without) * 100 if cost_without > 0 else 0

    print(f"Without caching: ${cost_without:.4f}/hour")
    print(f"With caching:    ${cost_with:.4f}/hour")
    print(f"Savings:         ${savings:.4f}/hour ({savings_pct:.1f}%)")
    return savings

Exam Tip: Cache matching is prefix-based and exact. Even a single character difference in the prefix (such as including a timestamp or request ID) will break the cache. The exam tests whether you understand that static content must come first and dynamic content must come after the cache boundary. A common wrong answer places user-specific or time-varying content inside the cached prefix.

Exam Tip: Know the minimum cacheable token counts: 1,024 tokens for Sonnet/Opus and 2,048 tokens for Haiku. If a question asks about caching a 200-token system prompt, the correct answer is that it will not be cached because it falls below the minimum threshold.

Exam Tip: Place the cache_control marker on thelast element you want included in the cache — typically the last tool definition, the last few-shot example, or the last message in the existing conversation history. Everything before and including that marker is the cached prefix.

Key Takeaways

Prompt caching reuses previously computed prefixes to reduce latency and cost. Cache writes cost 25% more than base input; cache reads cost 90% less.

Caching is prefix-based and exact. Structure prompts with static content (system prompts, tools, examples) first and dynamic content (user messages) last to maximize cache hits.

Best caching candidates are large, static prompt components: detailed system prompts, tool definitions, few-shot examples, and reference documents.

Monitor cache performance using cache_creation_input_tokensand cache_read_input_tokens in the API response to verify your caching strategy is working correctly.