Architect's Playbook

Best Practices Guide

Production-grade design patterns, anti-patterns, and architectural decisions for enterprise LLM systems. Based on The Architect's Playbook: Enterprise LLM Architecture.

Best Practices

Route for Cost and SLA

❌

The Problem

Defaulting to real-time Messages API for every request, even batch or async workloads, wastes cost and capacity.

✅

The Solution

Route requests based on urgency: real-time Messages API for urgent exceptions (<30min SLA), Batch API for standard workflows (50% cost savings), and continuous batch submission for steady-state arrival.

Key Rule

Never default to real-time for asynchronous needs.

🎯

Architecture & Routing

The Architect's Hierarchy of Constraints

❌

The Problem

Optimizing for the wrong constraint — e.g., reducing cost when accuracy is the real bottleneck.

✅

The Solution

Evaluate every design decision against four constraints: Latency (mitigated by parallelization & caching), Accuracy (mitigated by structured intermediates & few-shot prompts), Cost (mitigated by batch APIs & context pruning), and Compliance (enforced by application-layer intercepts, not prompts).

Key Rule

Compliance must be enforced in code, never relied upon through prompts alone.

📐

Architecture & Routing

Four Domains of AI Architecture

❌

The Problem

Applying the same architecture pattern to fundamentally different problem types.

✅

The Solution

Recognize four distinct architectural domains, each with unique constraints: Structured Data Extraction (high volume, strict schemas), Customer Support Orchestration (stateful, human-in-the-loop), Developer Productivity (dynamic tasks, iterative context), and Multi-Agent Systems (parallel processing, shared memory).

Key Rule

Match your architecture to the problem domain — one size does not fit all.

📋

Data Extraction & Schemas

Design Resilient Schemas

❌

The Problem

Schemas that break when encountering unexpected values — fragile enums that fail on edge cases like 'studio' or 'converted warehouse'.

✅

The Solution

Add an 'other' catch-all value to every enum, paired with a detail string field. This captures unexpected values without breaking the pipeline.

Key Rule

Every enum needs an 'other' option with a detail field — never assume you've seen all possible values.

🔢

Data Extraction & Schemas

Enforce Mathematical Consistency

❌

The Problem

18% of invoice extractions show line items that don't match the grand total due to OCR errors or extraction mistakes.

✅

The Solution

Use Schema Redundancy: extract both individual line item amounts AND the stated total from the document, then have the model compute a calculated_total by summing line items. Flag records for human review when calculated_total !== stated_total.

Key Rule

Always extract redundant fields and cross-validate mathematically.

🚫

Data Extraction & Schemas

Normalization and Null Handling

❌

The Problem

Models invent plausible hallucinated data (e.g., attendee_count: 500) when fields are nullable and the document doesn't mention them.

✅

The Solution

Add explicit null-handling instructions: 'If not mentioned in the text, return null.' For format inconsistencies (e.g., 'cotton blend' vs 'Cotton/Polyester mix'), use few-shot standardization examples.

Key Rule

Explicitly instruct null returns for missing fields — never let the model guess.

🔄

Reliability & Error Handling

Know the Limits of Automated Retry

❌

The Problem

Retrying endlessly on errors that automated retry cannot fix — like trying to extract author lists from documents that say 'et al.'

✅

The Solution

Retries are effective for formatting errors (nested objects, locale strings) — resolved in 2-3 attempts. Retries are ineffective for missing information that doesn't exist in the source. Recognize when to fail fast.

Key Rule

Retry formatting errors. Fail fast on missing information.

🔒

Reliability & Error Handling

Zero-Tolerance Compliance

❌

The Problem

Relying on emphatic system prompts ('CRITICAL POLICY: NEVER process >$500') still yields a 3% failure rate.

✅

The Solution

Implement application-layer intercepts that block policy violations in code. When a process amount exceeds the threshold, block it server-side and invoke human escalation. Remove model discretion entirely.

Key Rule

Compliance rules must be enforced in application code, never in prompts alone.

⚠️

Reliability & Error Handling

Graceful Tool Failure

❌

The Problem

Application exceptions that crash the agent, or tools returning empty strings on failure.

✅

The Solution

Return error messages in the tool result content with an isError flag. Include error category (transient vs permanent) and whether it's retryable. Let the agent decide how to respond to the user.

Key Rule

Never throw exceptions or return empty strings from tools — return structured error objects.

👤

Reliability & Error Handling

Calibrate Human-in-the-Loop

❌

The Problem

No systematic approach to deciding what the model handles autonomously vs. what requires human review.

✅

The Solution

Have the model output field-level confidence scores. Route extractions with confidence >90% to automated processing, and those below to a human review queue. Validate by analyzing accuracy across document types and fields.

Key Rule

Set confidence thresholds per field and document type — validate before deploying.

✂️

Context & State Management

Tool Context Pruning

❌

The Problem

Repeatedly calling lookup_order fills the context window with verbose shipping and payment data when only the return status is needed.

✅

The Solution

Implement application-side filtering: extract only the relevant fields (items, purchase data, return window, status) from tool responses before injecting them into the conversation. Remove verbose details the agent doesn't need.

Key Rule

Filter tool responses application-side before they enter the context window.

📦

Context & State Management

Compress Long Sessions

❌

The Problem

A 48-turn support session covering refund, subscription, and payment update approaches context limits.

✅

The Solution

Summarize earlier, resolved turns into a narrative description. Preserve full verbatim message history only for the active, unresolved issue. This keeps the context window focused on what matters.

Key Rule

Summarize resolved issues, keep active issue verbatim.

⏸️

Context & State Management

Resume Asynchronous Sessions Safely

❌

The Problem

Resuming a session hours later causes the agent to confidently state outdated status (e.g., 'Expected resolution: 24h' from a previous tool call).

✅

The Solution

Resume with full conversation history, but programmatically filter out previous tool_result messages. Keep only human/assistant turns. This forces the agent to re-fetch current data via fresh tool calls.

Key Rule

Filter out stale tool_results on session resumption — force fresh data fetches.

📝

Context & State Management

The Scratchpad Pattern

❌

The Problem

In extended exploration sessions (30+ mins), accumulated token bloat causes inconsistent answers about early discoveries.

✅

The Solution

Have the agent actively maintain a scratchpad file recording key findings, architectural maps, and decisions. It references this dense, structured file for subsequent questions rather than relying on raw message history.

Key Rule

Use a structured scratchpad file as continuous reference for long exploration sessions.

🔧

Multi-Agent & Tool Design

MCP Tool Specificity

❌

The Problem

Providing a broad custom tool (analyze_dependencies) alongside built-in tools like Grep — the agent defaults to Grep and ignores the custom tool.

✅

The Solution

Split broad tools into highly granular, single-purpose tools (list_imports, resolve_transitive_deps, detect_circular_deps). Enhance descriptions to explicitly detail capabilities and expected outputs.

Key Rule

Split broad tools into granular, single-purpose tools with explicit descriptions.

💾

Multi-Agent & Tool Design

Shared Memory Architecture

❌

The Problem

Daisy-chaining full conversation logs between subagents scales token costs exponentially.

✅

The Solution

Decouple state from invocation. Have subagents index their outputs into a shared vector store. Subsequent agents use semantic search to retrieve only relevant prior findings. This prevents state loss if a multi-agent pipeline crashes mid-processing.

Key Rule

Use a shared vector store instead of passing full conversation logs between agents.

🎯

Multi-Agent & Tool Design

Goal-Oriented Delegation

❌

The Problem

Giving subagents detailed step-by-step procedural instructions causes rigid execution that fails on edge cases or misses tangential sources.

✅

The Solution

Specify research goals and quality criteria rather than procedural steps. Let the specialized subagent determine its own search strategy. Use enum parameters (e.g., analysis_type: extraction | summarization) to guide behavior without micromanaging.

Key Rule

Delegate goals and criteria, not step-by-step procedures.

📌

Multi-Agent & Tool Design

Force Execution Order with tool_choice

❌

The Problem

An agent needs to extract metadata before calling enrichment tools, but occasionally calls enrichment first, leading to failures.

✅

The Solution

Use tool_choice: {type: 'tool', name: 'extract_metadata'} in the first API call to force the correct execution order. Don't rely on prompt instructions for sequencing — use the API's constraints.

Key Rule

Use tool_choice to enforce execution order — don't rely on prompt begging.

🔄

Multi-Agent & Tool Design

Structured Intermediate Representations

❌

The Problem

Passing raw text from financial and news agents to a synthesis agent loses table clarity and narrative flow.

✅

The Solution

Add a format conversion layer that standardizes all subagent outputs into a common JSON intermediate representation with claim, evidence, source, and confidence fields. Require all subagents to output structured claim-source mappings.

Key Rule

Standardize all agent outputs into a common structured format before synthesis.

⚡

Multi-Agent & Tool Design

Parallelization & Prompt Caching

❌

The Problem

Sequential processing of 12 legal precedents takes over 3 minutes — unacceptable latency.

✅

The Solution

Spawn parallel subagents, each handling a subset of precedents. For the synthesis step, enable prompt caching when accumulated findings exceed 80K tokens to reduce transfer overhead. Parallel execution reduces wall-clock time from 3+ minutes to ~30 seconds.

Key Rule

Parallelize independent work. Cache when accumulated context exceeds 80K tokens.

🌿

Multi-Agent & Tool Design

Branching Reality with fork_session

❌

The Problem

An agent exploring multiple solution paths sequentially wastes time and context — each dead-end pollutes the context window for the next attempt.

✅

The Solution

Use fork_session to clone the current conversation state into parallel branches. Each branch independently explores a different approach. Merge only the winning branch back into the main session, discarding failed explorations entirely.

Key Rule

Fork before exploring — never pollute the main context with experimental dead-ends.

🔄

Context & State Management

Resumption in Dynamic Environments

❌

The Problem

A support agent resumes a session but the customer's order status, payment, or ticket has changed since the last interaction — the agent uses stale cached data.

✅

The Solution

On session resumption, inject a 'state_changed' system event listing what has changed since the last interaction. The agent acknowledges changes before continuing. Build a diff-aware resumption layer that compares current state to cached state.

Key Rule

Always inject a state-diff on resumption — never assume the world hasn't changed.

🚨

Reliability & Error Handling

Escalation Handoff Payload

❌

The Problem

When an agent escalates to a human, the human agent has no context — they ask the customer to repeat everything, destroying the experience.

✅

The Solution

Build a structured escalation payload that includes: conversation summary, customer sentiment, actions already taken, reason for escalation, and recommended next steps. The human agent receives a complete briefing before taking over.

Key Rule

Every escalation must include a structured handoff payload — never escalate naked.

🗺️

Multi-Agent & Tool Design

Directed Codebase Exploration

❌

The Problem

A coding agent uses brute-force grep across an entire repository, consuming thousands of tokens on irrelevant files before finding what it needs.

✅

The Solution

Start with high-level structural files (package.json, directory listings, README) to build a mental map. Then navigate precisely to relevant modules. Use the scratchpad pattern to record the codebase map for future reference.

Key Rule

Map first, then navigate — never brute-force search an unfamiliar codebase.

📡

Architecture & Routing

Streaming with Server-Sent Events

❌

The Problem

Users stare at a blank screen for 10-30 seconds while the model generates a complete response — perceived latency destroys UX.

✅

The Solution

Use the Streaming Messages API with Server-Sent Events (SSE). Stream tokens as they're generated so users see progressive output. Handle event types: message_start, content_block_delta, message_stop. Implement client-side token buffering for smooth rendering.

Key Rule

Always stream responses in user-facing applications — never make users wait for complete generation.

💰

Architecture & Routing

Prompt Caching Strategy

❌

The Problem

Repeated API calls with identical system prompts and long reference documents burn through tokens — each call re-processes the same static content.

✅

The Solution

Use prompt caching to mark static content (system prompts, reference docs, few-shot examples) as cacheable. Cache hits cost 90% less than re-processing. Structure messages so static content comes first, dynamic content last.

Key Rule

Put static content first, dynamic content last — maximize cache hit rate.

📊

Architecture & Routing

The Architect's Reference Matrix

❌

The Problem

Architects struggle to choose between API endpoints, models, and patterns — no single reference maps scenarios to optimal configurations.

✅

The Solution

Use a decision matrix mapping each scenario type to its optimal API endpoint, model, context strategy, and error handling approach. The matrix covers: data extraction (Batch API + strict schemas), support agents (Messages API + sliding window), code agents (Messages API + scratchpad), and multi-agent (parallel subagents + shared memory).

Key Rule

Consult the reference matrix before designing — don't reinvent the decision tree.

The Production Architecture Blueprint

Intelligence at the edges. Strict typing in the middle. Application intercepts guarding the core. Shared memory sustaining the lifecycle.

Study the Lessons Take Mock Exam