🔧Tool Design & MCPLesson 2.6

Tool Boundary Management

Managing tool complexity and preventing model overload.

15 min

Learning Objectives

Identify when too many tools degrade performance
Strategies for organizing and scoping tool sets
Dynamic tool selection patterns

Tool Boundary Management

As agentic systems grow in complexity, tool management becomes a critical architectural challenge. When an agent has access to too many tools, performance degrades -- the model spends more tokens reasoning about which tool to use, makes more selection errors, and response latency increases. This lesson covers strategies for organizing, selecting, and managing tools at scale.

The “Too Many Tools” Problem

Research and practical experience show that Claude's tool selection accuracy degrades as the number of available tools increases. This is not merely a matter of context window size -- it is a fundamental challenge of decision-making complexity.

Token overhead: Each tool definition consumes tokens in the context window. With 50 tools, tool definitions alone can consume 10,000+ tokens.
Selection confusion: With many similar tools, the model may select the wrong one, especially when tool descriptions overlap or are ambiguous.
Increased latency: More tools means more reasoning about which tool to use, increasing time-to-first-token.
Error amplification: In multi-step agentic loops, tool selection errors compound across iterations. A wrong tool choice early in the loop can send the agent down a completely wrong path.
Decreased output quality: The model's attention is split across many tool definitions, which can reduce the quality of its reasoning about the actual task.

Exam Tip: The exam tests awareness of the “too many tools” problem and its solutions. Anthropic's guidance is clear: keep the active tool set as small as possible for any given interaction. The recommended approach is to provide only the tools relevant to the current task context, not every tool the system supports. Performance is best with 5-10 focused tools per API call.

Strategy 1: Tool Categorization and Namespacing

Organize tools into logical categories and use consistent naming conventions. This helps both the model and human developers understand the tool landscape.

# Organize tools by domain using prefixes
tool_categories = {
    "user_management": [
        {"name": "user_lookup", "description": "Look up a user by ID or email..."},
        {"name": "user_create", "description": "Create a new user account..."},
        {"name": "user_update", "description": "Update user profile fields..."},
        {"name": "user_deactivate", "description": "Deactivate a user account..."},
    ],
    "order_management": [
        {"name": "order_search", "description": "Search orders by criteria..."},
        {"name": "order_details", "description": "Get full details for an order..."},
        {"name": "order_update_status", "description": "Update order status..."},
        {"name": "order_refund", "description": "Process a refund for an order..."},
    ],
    "analytics": [
        {"name": "analytics_revenue", "description": "Get revenue metrics..."},
        {"name": "analytics_usage", "description": "Get usage statistics..."},
        {"name": "analytics_funnel", "description": "Get conversion funnel data..."},
    ]
}

Benefits of categorization:

Consistent naming prefixes help the model distinguish between domains
Categories map naturally to MCP server boundaries
Easier to select relevant subsets based on context
Simplifies monitoring and metrics per domain

Strategy 2: Dynamic Tool Selection

Instead of providing all tools in every request, dynamically select which tools to include based on the conversation context. This is one of the most effective strategies for managing tool boundaries.

Keyword-Based Tool Filtering

def select_tools_by_context(user_message: str, all_tools: dict) -> list:
    """Select relevant tools based on keywords in the user message."""
    keyword_to_category = {
        "user": "user_management",
        "account": "user_management",
        "profile": "user_management",
        "order": "order_management",
        "purchase": "order_management",
        "refund": "order_management",
        "revenue": "analytics",
        "metrics": "analytics",
        "analytics": "analytics",
        "report": "analytics",
    }

    selected_categories = set()
    message_lower = user_message.lower()

    for keyword, category in keyword_to_category.items():
        if keyword in message_lower:
            selected_categories.add(category)

    # If no specific category matched, include a general-purpose subset
    if not selected_categories:
        selected_categories = {"user_management"}  # Default category

    selected_tools = []
    for category in selected_categories:
        selected_tools.extend(all_tools.get(category, []))

    return selected_tools


# Usage in the agentic loop
user_message = "I need to check on the refund status for order #12345"
tools = select_tools_by_context(user_message, tool_categories)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    tools=tools,  # Only order-related tools included
    messages=[{"role": "user", "content": user_message}]
)

Embedding-Based Tool Selection

For more sophisticated selection, use embeddings to match the user's query against tool descriptions and select the most semantically relevant tools. This handles cases where keywords alone are insufficient.

import numpy as np


class ToolSelector:
    """Select relevant tools using semantic similarity."""

    def __init__(self, all_tools: list, embedding_model):
        self.all_tools = all_tools
        self.model = embedding_model

        # Pre-compute embeddings for all tool descriptions
        self.tool_embeddings = []
        for tool in all_tools:
            text = f"{tool['name']}: {tool['description']}"
            embedding = self.model.embed(text)
            self.tool_embeddings.append(embedding)

    def select(self, query: str, top_k: int = 5) -> list:
        """Select the top-k most relevant tools for a query."""
        query_embedding = self.model.embed(query)

        # Compute cosine similarity
        similarities = []
        for i, tool_emb in enumerate(self.tool_embeddings):
            sim = np.dot(query_embedding, tool_emb) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(tool_emb)
            )
            similarities.append((sim, i))

        # Return top-k tools
        similarities.sort(reverse=True)
        selected = [self.all_tools[idx] for _, idx in similarities[:top_k]]
        return selected


# Usage
selector = ToolSelector(all_tools, embedding_model)
relevant_tools = selector.select("How much revenue did we make last quarter?", top_k=5)

Exam Tip: Dynamic tool selection is a key concept for the exam. Know the two primary approaches: keyword-based filtering (simple, fast, works well for clearly categorized tools) and embedding-based selection (more sophisticated, handles semantic similarity). The exam may present a scenario with 50+ tools and ask you to identify the correct strategy for managing them.

Strategy 3: Two-Stage Tool Selection (Tool Routing)

Use a lightweight first pass to select tool categories, then provide only the tools from the selected categories to the main model. This is sometimes called “tool routing” and is one of the most powerful patterns for large tool sets.

import json

def two_stage_tool_selection(user_message: str, tool_categories: dict) -> list:
    """Use Claude to select relevant tool categories, then provide those tools."""

    # Stage 1: Use a fast model to classify the request
    category_descriptions = {
        name: f"{name}: {tools[0]['description'][:100]}..."
        for name, tools in tool_categories.items()
    }

    classification = client.messages.create(
        model="claude-haiku-4-20250514",  # Fast, cheap model for classification
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"Classify this request into one or more categories."
                       f"Categories: {json.dumps(category_descriptions)}"
                       f"Request: {user_message}"
                       f"Return only the category names as a JSON array."
        }]
    )

    categories = json.loads(classification.content[0].text)

    # Stage 2: Collect tools from selected categories
    selected_tools = []
    for cat in categories:
        if cat in tool_categories:
            selected_tools.extend(tool_categories[cat])

    return selected_tools

Two-stage selection is effective because:

The classification model (e.g., Haiku) is fast and cheap
It can handle ambiguous queries better than keyword matching
The main model only sees a focused, relevant tool set
It scales well as the total number of tools grows

Strategy 4: Tool Consolidation

Sometimes the best approach is to reduce the total number of tools by consolidating related operations into fewer, more flexible tools. This must be balanced against the scoping principles from Lesson 2.2 -- consolidate where it reduces confusion, not where it creates “god tools.”

# Before: 4 separate tools that could confuse the model
# - search_orders_by_date
# - search_orders_by_customer
# - search_orders_by_status
# - search_orders_by_amount

# After: 1 consolidated tool with optional filters
{
    "name": "search_orders",
    "description": "Search orders with optional filters. Provide one or more filter criteria to narrow results. Returns matching orders with ID, date, customer, status, and total amount.",
    "input_schema": {
        "type": "object",
        "properties": {
            "customer_id": {
                "type": "string",
                "description": "Filter by customer ID"
            },
            "status": {
                "type": "string",
                "enum": ["pending", "processing", "shipped", "delivered", "cancelled"],
                "description": "Filter by order status"
            },
            "date_from": {
                "type": "string",
                "description": "Filter orders after this date (ISO 8601)"
            },
            "date_to": {
                "type": "string",
                "description": "Filter orders before this date (ISO 8601)"
            },
            "min_amount": {
                "type": "number",
                "description": "Minimum order amount"
            },
            "max_amount": {
                "type": "number",
                "description": "Maximum order amount"
            },
            "limit": {
                "type": "integer",
                "description": "Maximum results to return (default 20)",
                "default": 20
            }
        }
    }
}

When to Consolidate vs. Keep Separate

Consolidate when: Multiple tools differ only in their filter parameters (like the search example above), or when the tools share the same underlying operation.
Keep separate when: Tools have different side effects (read vs. write), different security requirements, or fundamentally different purposes.

Strategy 5: MCP Server Composition

When using MCP, you can distribute tools across multiple servers and only connect to the servers relevant to the current session. This is a natural way to manage tool boundaries at the architectural level.

# Claude Desktop configuration with role-based server access
# Configuration for a customer support agent
{
    "mcpServers": {
        "customer-db": {
            "command": "python",
            "args": ["servers/customer_server.py"],
            "env": {"DB_URL": "postgresql://localhost/customers"}
        },
        "order-system": {
            "command": "python",
            "args": ["servers/order_server.py"],
            "env": {"DB_URL": "postgresql://localhost/orders"}
        },
        "knowledge-base": {
            "command": "python",
            "args": ["servers/kb_server.py"],
            "env": {"KB_PATH": "/data/support-kb"}
        }
    }
}

# Configuration for a data analyst agent (different tool set)
{
    "mcpServers": {
        "analytics-db": {
            "command": "python",
            "args": ["servers/analytics_server.py"],
            "env": {"DB_URL": "postgresql://localhost/analytics"}
        },
        "visualization": {
            "command": "python",
            "args": ["servers/viz_server.py"]
        },
        "data-warehouse": {
            "command": "python",
            "args": ["servers/warehouse_server.py"],
            "env": {"WAREHOUSE_URL": "bigquery://project/dataset"}
        }
    }
}

Measuring Tool Selection Quality

To evaluate whether your tool management strategy is working, track these metrics:

Tool selection accuracy: What percentage of tool calls select the correct tool on the first attempt?
Tool call failure rate: What percentage of tool calls result in errors (wrong parameters, wrong tool, etc.)?
Average tools per request: How many tools are included in the average API call? Lower is generally better.
Unnecessary tool calls: How often does the model call a tool when it did not need to?
Token overhead: What percentage of context window tokens are consumed by tool definitions?

# Tracking tool selection metrics
class ToolMetrics:
    def __init__(self):
        self.total_calls = 0
        self.correct_selections = 0
        self.error_calls = 0
        self.tool_tokens_used = 0

    def record_call(self, tool_name: str, was_correct: bool, had_error: bool):
        self.total_calls += 1
        if was_correct:
            self.correct_selections += 1
        if had_error:
            self.error_calls += 1

    @property
    def accuracy(self) -> float:
        if self.total_calls == 0:
            return 0.0
        return self.correct_selections / self.total_calls

    @property
    def error_rate(self) -> float:
        if self.total_calls == 0:
            return 0.0
        return self.error_calls / self.total_calls

    def report(self) -> str:
        return (
            f"Tool Metrics Report:"
            f"  Total calls: {self.total_calls}"
            f"  Selection accuracy: {self.accuracy:.1%}"
            f"  Error rate: {self.error_rate:.1%}"
        )

Practical Guidelines

Based on Anthropic's recommendations and production experience, follow these guidelines for tool boundary management:

Keep active tool count under 20: For any single API call, aim for fewer than 20 tools. Performance is best with 5-10 focused tools.
Use descriptive, distinct tool names: If two tools could be confused by their names alone, reconsider the naming or consolidate them.
Include negative guidance in descriptions: Tell Claude when NOT to use a tool (e.g., “Do not use this for general knowledge questions”).
Test tool disambiguation: Create test cases where multiple tools could plausibly apply and verify correct selection.
Monitor and iterate: Track tool selection metrics in production and refine tool definitions based on real error patterns.
Use the system prompt for tool guidance: The system prompt can include instructions about when to prefer certain tools over others.

# Using the system prompt to guide tool selection
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    system="""You are a customer support agent with access to the following tool categories:

- Customer tools (user_lookup, user_update): Use for questions about customer accounts
- Order tools (order_search, order_details, order_refund): Use for order-related inquiries
- Knowledge base (kb_search): Use for policy questions and general support procedures

Always try the knowledge base first for policy questions before looking up specific records.
For refund requests, always look up the order details first before processing a refund.""",
    tools=selected_tools,
    messages=[{"role": "user", "content": user_message}]
)

Summary of Tool Management Strategies

Here is a comparison of when to use each strategy:

Categorization and namespacing: Always use this as a foundation. Organize tools logically regardless of other strategies.
Keyword-based dynamic selection: Use when tool categories map cleanly to specific keywords in user messages. Fast and simple.
Embedding-based selection: Use when user queries are diverse and don't map neatly to keywords. More robust but adds latency.
Two-stage routing: Use when you have many categories and need AI-level reasoning to select the right ones. Best for 50+ tools.
Consolidation: Use when you have many tools that differ only in filter parameters. Reduces count without losing capability.
MCP server composition: Use when different roles or use cases need different tool sets. Natural architectural boundary.

Exam Tip: The exam tests multiple tool management strategies. The key strategies to remember are: (1) dynamic tool selection based on context, (2) two-stage routing with a classifier, (3) tool consolidation to reduce count, (4) MCP server composition for architectural separation, and (5) system prompt guidance for tool preference. The exam favors the principle that fewer, well-chosen tools outperform a large undifferentiated tool set.

Key Takeaway: Tool boundary management is about ensuring Claude always has access to the right tools -- not all tools, not too few tools, but exactly the tools needed for the current context. The core strategies are categorization (organizing tools into logical groups), dynamic selection (choosing tools based on the user's query), consolidation (reducing redundant tools), and architectural separation (using MCP server composition). Monitor tool selection accuracy and error rates to continuously refine your approach. Anthropic's guidance is clear: fewer, well-described tools always outperform a sprawling, undifferentiated tool catalog.