🏗️Agentic ArchitectureLesson 1.3

Routing & Parallelization Patterns

Classify inputs to specialized handlers and run tasks concurrently.

25 min

Learning Objectives

Build routing logic for input classification
Implement sectioning and voting parallelization
Choose between routing vs. parallelization

Routing and Parallelization Patterns

Domain 1 covers two additional workflow patterns that go beyond simple sequential chaining: routing and parallelization. Both patterns address different challenges — routing handles input diversity, while parallelization handles throughput and confidence. Knowing when to apply each, and how they differ from each other and from prompt chaining, is a key competency tested in the exam.

Routing: Classifying Inputs to Specialized Handlers

Routing is the pattern of examining an incoming request and directing it to the most appropriate handler — a specialized prompt, a different model, a different workflow, or a human escalation path. The core idea is that a single generalist handler performs worse than multiple specialized handlers, each optimized for a specific input type.

Anthropic's documentation describes routing as allowing "different types of inputs [to be] handled by specialized workflows or prompts optimized for those inputs." Rather than writing one enormous prompt that tries to handle every case, you write a lean router and multiple focused handlers.

When Routing Is the Right Pattern

Your inputs fall into meaningfully different categories that benefit from different instructions, tone, or tools.
A single prompt handling all categories would become unwieldy or would compromise quality for at least one category.
Some input types should be handled differently for cost or latency reasons — for example, routing simple FAQs to a smaller, faster model while complex technical questions go to a larger one.
Certain categories require human escalation rather than automated handling.

Implementation Patterns for Routing

There are two primary approaches to implementing the routing decision itself:

LLM-based classification: Use a model call to classify the input and return a structured category. This handles nuanced, ambiguous, or natural-language inputs well but adds latency and token cost. Best for cases where the classification itself requires understanding context or intent.
Keyword or rule-based classification: Use deterministic logic — keyword matching, regex, or structured data fields — to route without an LLM call. This is faster, cheaper, and fully predictable. Best for cases where routing criteria are explicit and unambiguous (e.g., an "account_type" field in the request payload).

A hybrid approach is also common: use rule-based routing for high-confidence cases (e.g., a request that includes a structured issue_type field) and fall back to LLM classification for ambiguous or free-text inputs.

Code Example: LLM-Based Routing

The following example routes customer support tickets to specialized handlers based on LLM-classified intent.

import anthropic

client = anthropic.Anthropic()

def llm(prompt: str, system: str = "") -> str:
    kwargs = {"model": "claude-opus-4-5", "max_tokens": 512,
              "messages": [{"role": "user", "content": prompt}]}
    if system:
        kwargs["system"] = system
    return client.messages.create(**kwargs).content[0].text.strip()

# ---------- Router ----------
def classify_route(user_input: str) -> str:
    """Classify the input and return a routing key."""
    prompt = f"""Classify the following customer message into exactly one category.
Respond with ONLY the category name.

Categories:
- billing      (payment issues, invoices, refunds, subscription charges)
- technical    (bugs, errors, integration problems, API issues)
- sales        (pricing questions, plan upgrades, feature inquiries)
- escalate     (frustrated customers, legal threats, executive requests)
- general      (everything else)

Message: {user_input}"""
    return llm(prompt).lower().strip()

# ---------- Specialized handlers ----------
def handle_billing(message: str) -> str:
    return llm(message, system="""You are a billing specialist at a SaaS company.
You have access to account and payment information. Be precise about timelines for
refunds (3-5 business days) and escalation paths. Offer a specific next step.""")

def handle_technical(message: str) -> str:
    return llm(message, system="""You are a senior technical support engineer.
Diagnose the problem systematically. Ask for error messages, logs, and reproduction
steps if they are missing. Reference documentation where applicable.""")

def handle_sales(message: str) -> str:
    return llm(message, system="""You are a sales engineer. Help the customer
understand which plan meets their needs. Focus on value and ROI. Do not discount
without manager approval.""")

def handle_escalation(message: str) -> str:
    return llm(message, system="""You are a senior customer success manager.
Acknowledge the customer's frustration immediately. Apologize where appropriate.
Commit to specific follow-up within 24 hours and offer direct contact information.""")

def handle_general(message: str) -> str:
    return llm(message, system="You are a helpful customer support agent.")

# ---------- Route dispatcher ----------
HANDLERS = {
    "billing": handle_billing,
    "technical": handle_technical,
    "sales": handle_sales,
    "escalate": handle_escalation,
    "general": handle_general,
}

def route_and_handle(user_input: str) -> dict:
    route = classify_route(user_input)
    handler = HANDLERS.get(route, handle_general)
    print(f"[Router] Classified as: {route}")
    response = handler(user_input)
    return {"route": route, "response": response}

# Example
result = route_and_handle(
    "I've been trying to integrate your webhook API for two days and keep getting "
    "a 401 error even though my API key is correct. This is blocking our launch."
)
print(f"Route: {result['route']}")
print(f"Response: {result['response']}")

Parallelization: Doing More at Once

Parallelization is the pattern of running multiple LLM calls simultaneously rather than sequentially. Anthropic identifies two distinct sub-patterns within parallelization: sectioning and voting.

Sectioning (Splitting Work)

Sectioning divides a large task into independent chunks that can be processed simultaneously, with the results aggregated afterward. This is appropriate when:

The task is larger than fits comfortably in a single context window.
The subtasks are genuinely independent — the output of one does not affect another.
Latency is a concern, and parallel execution would meaningfully reduce wall-clock time.

Examples: processing multiple documents simultaneously, generating several sections of a long report in parallel, or analyzing different time periods of a dataset concurrently.

Voting (Multiple Attempts)

Voting runs the same task multiple times — potentially with the same or varied prompts — and uses majority agreement or a judge model to select the best or most reliable answer. This is appropriate when:

The task requires high confidence and the cost of an error is significant.
LLM outputs are stochastic and single-pass accuracy is insufficient.
You want to detect when the model is uncertain — low agreement across attempts signals low confidence.

Examples: safety checks run multiple times to reduce false-negative rates, consensus-based medical or legal analysis, or code generation where the most common correct answer is selected.

Code Example: Parallelization with Sectioning

The following example processes multiple customer reviews simultaneously and then aggregates the sentiment results.

import anthropic
import concurrent.futures
import json

client = anthropic.Anthropic()

def analyze_review(review: str) -> dict:
    """Analyze a single review — this will run in parallel for each review."""
    prompt = f"""Analyze the sentiment of this product review. Return JSON with:
- sentiment: "positive", "neutral", or "negative"
- score: integer from 1 (very negative) to 5 (very positive)
- key_theme: the main topic the customer is commenting on (one phrase)

Review: {review}

Respond with valid JSON only."""
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    raw = response.content[0].text.strip()
    # Remove any markdown code fence wrapping from the response
    import re
    raw = re.sub(r'^[^{\[]*', '', raw.strip())  # strip to first JSON char
    return json.loads(raw)

def aggregate_results(results: list[dict]) -> dict:
    """Summarize the parallel analysis results."""
    sentiment_counts = {"positive": 0, "neutral": 0, "negative": 0}
    total_score = 0
    themes = []

    for r in results:
        sentiment_counts[r["sentiment"]] += 1
        total_score += r["score"]
        themes.append(r["key_theme"])

    avg_score = total_score / len(results) if results else 0
    return {
        "total_reviews": len(results),
        "sentiment_breakdown": sentiment_counts,
        "average_score": round(avg_score, 2),
        "top_themes": themes,
    }

def analyze_reviews_in_parallel(reviews: list[str]) -> dict:
    """Run sentiment analysis on all reviews in parallel using thread pool."""
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_review = {executor.submit(analyze_review, r): r for r in reviews}
        for future in concurrent.futures.as_completed(future_to_review):
            try:
                results.append(future.result())
            except Exception as e:
                print(f"Review analysis failed: {e}")

    return aggregate_results(results)

# Example usage
reviews = [
    "Absolutely love this product! The build quality is exceptional and it arrived early.",
    "Battery life is disappointing. Expected 10 hours, getting maybe 5. Very let down.",
    "Works as described. Nothing special but does the job for the price.",
    "Customer service was outstanding when I had a setup issue. Five stars for support.",
    "Broke after two weeks. Returning immediately. Poor quality control.",
]

summary = analyze_reviews_in_parallel(reviews)
print(json.dumps(summary, indent=2))

All five reviews are submitted to the API concurrently via a thread pool executor. The wall-clock time is approximately equal to the slowest single call rather than the sum of all calls — a significant throughput improvement for larger batches.

When to Use Routing vs. Parallelization

These two patterns solve different problems and are often confused with each other on the exam:

Use routing when your inputs are heterogeneous and different categories benefit from different handling. The distinguishing characteristic is a classification step that decides which path to take. One input → one handler.
Use parallelization (sectioning) when you have multiple independent items or chunks that can all be processed the same way simultaneously. The distinguishing characteristic is concurrency over a homogeneous workload. Many inputs → same handler, run concurrently.
Use parallelization (voting) when you need higher confidence on a single task than a single LLM call can reliably provide. The distinguishing characteristic is redundancy — the same question asked multiple times.
Combine them when needed: a router might classify an input and then dispatch it to a parallelized pipeline, or a parallelized section processor might route each chunk to different specialized handlers based on content type.

Exam Tip: The exam frequently tests your ability to distinguish routing from parallelization. The key differentiator is why you are using multiple handlers. If multiple handlers exist because inputs are different from each other(and need specialized treatment), that is routing. If multiple handlers exist because you want to process them at the same time (for speed) or redundantly (for confidence), that is parallelization. A routing question will always involve a classification step. A parallelization question will always involve concurrent execution or aggregation across multiple results.

Key Takeaways

Routing classifies inputs and directs them to specialized handlers. It improves quality by allowing each handler to be purpose-built for a specific input type. The router itself can be LLM-based (for nuanced classification) or rule-based (for speed and predictability).

Parallelization — sectioning splits a large or batched task into independent chunks and processes them concurrently. It improves throughput and allows tasks that exceed a single context window to be handled effectively.

Parallelization — voting runs the same task multiple times and aggregates results for higher confidence. It is appropriate when accuracy is critical and single-pass LLM reliability is insufficient for the use case.

Routing and parallelization are complementary, not competing. Real-world systems often combine them: route heterogeneous inputs to appropriate handlers, and use parallelization within each handler to improve throughput or confidence.