Best Practices Guide
Production-grade design patterns, anti-patterns, and architectural decisions for enterprise LLM systems. Based on The Architect's Playbook: Enterprise LLM Architecture.
27
Best Practices
5
Categories
5
Exam Domains
Route for Cost and SLA
The Problem
Defaulting to real-time Messages API for every request, even batch or async workloads, wastes cost and capacity.
The Solution
Route requests based on urgency: real-time Messages API for urgent exceptions (<30min SLA), Batch API for standard workflows (50% cost savings), and continuous batch submission for steady-state arrival.
Key Rule
Never default to real-time for asynchronous needs.
The Architect's Hierarchy of Constraints
The Problem
Optimizing for the wrong constraint β e.g., reducing cost when accuracy is the real bottleneck.
The Solution
Evaluate every design decision against four constraints: Latency (mitigated by parallelization & caching), Accuracy (mitigated by structured intermediates & few-shot prompts), Cost (mitigated by batch APIs & context pruning), and Compliance (enforced by application-layer intercepts, not prompts).
Key Rule
Compliance must be enforced in code, never relied upon through prompts alone.
Four Domains of AI Architecture
The Problem
Applying the same architecture pattern to fundamentally different problem types.
The Solution
Recognize four distinct architectural domains, each with unique constraints: Structured Data Extraction (high volume, strict schemas), Customer Support Orchestration (stateful, human-in-the-loop), Developer Productivity (dynamic tasks, iterative context), and Multi-Agent Systems (parallel processing, shared memory).
Key Rule
Match your architecture to the problem domain β one size does not fit all.
Design Resilient Schemas
The Problem
Schemas that break when encountering unexpected values β fragile enums that fail on edge cases like 'studio' or 'converted warehouse'.
The Solution
Add an 'other' catch-all value to every enum, paired with a detail string field. This captures unexpected values without breaking the pipeline.
Key Rule
Every enum needs an 'other' option with a detail field β never assume you've seen all possible values.
Enforce Mathematical Consistency
The Problem
18% of invoice extractions show line items that don't match the grand total due to OCR errors or extraction mistakes.
The Solution
Use Schema Redundancy: extract both individual line item amounts AND the stated total from the document, then have the model compute a calculated_total by summing line items. Flag records for human review when calculated_total !== stated_total.
Key Rule
Always extract redundant fields and cross-validate mathematically.
Normalization and Null Handling
The Problem
Models invent plausible hallucinated data (e.g., attendee_count: 500) when fields are nullable and the document doesn't mention them.
The Solution
Add explicit null-handling instructions: 'If not mentioned in the text, return null.' For format inconsistencies (e.g., 'cotton blend' vs 'Cotton/Polyester mix'), use few-shot standardization examples.
Key Rule
Explicitly instruct null returns for missing fields β never let the model guess.
Know the Limits of Automated Retry
The Problem
Retrying endlessly on errors that automated retry cannot fix β like trying to extract author lists from documents that say 'et al.'
The Solution
Retries are effective for formatting errors (nested objects, locale strings) β resolved in 2-3 attempts. Retries are ineffective for missing information that doesn't exist in the source. Recognize when to fail fast.
Key Rule
Retry formatting errors. Fail fast on missing information.
Zero-Tolerance Compliance
The Problem
Relying on emphatic system prompts ('CRITICAL POLICY: NEVER process >$500') still yields a 3% failure rate.
The Solution
Implement application-layer intercepts that block policy violations in code. When a process amount exceeds the threshold, block it server-side and invoke human escalation. Remove model discretion entirely.
Key Rule
Compliance rules must be enforced in application code, never in prompts alone.
Graceful Tool Failure
The Problem
Application exceptions that crash the agent, or tools returning empty strings on failure.
The Solution
Return error messages in the tool result content with an isError flag. Include error category (transient vs permanent) and whether it's retryable. Let the agent decide how to respond to the user.
Key Rule
Never throw exceptions or return empty strings from tools β return structured error objects.
Calibrate Human-in-the-Loop
The Problem
No systematic approach to deciding what the model handles autonomously vs. what requires human review.
The Solution
Have the model output field-level confidence scores. Route extractions with confidence >90% to automated processing, and those below to a human review queue. Validate by analyzing accuracy across document types and fields.
Key Rule
Set confidence thresholds per field and document type β validate before deploying.
Tool Context Pruning
The Problem
Repeatedly calling lookup_order fills the context window with verbose shipping and payment data when only the return status is needed.
The Solution
Implement application-side filtering: extract only the relevant fields (items, purchase data, return window, status) from tool responses before injecting them into the conversation. Remove verbose details the agent doesn't need.
Key Rule
Filter tool responses application-side before they enter the context window.
Compress Long Sessions
The Problem
A 48-turn support session covering refund, subscription, and payment update approaches context limits.
The Solution
Summarize earlier, resolved turns into a narrative description. Preserve full verbatim message history only for the active, unresolved issue. This keeps the context window focused on what matters.
Key Rule
Summarize resolved issues, keep active issue verbatim.
Resume Asynchronous Sessions Safely
The Problem
Resuming a session hours later causes the agent to confidently state outdated status (e.g., 'Expected resolution: 24h' from a previous tool call).
The Solution
Resume with full conversation history, but programmatically filter out previous tool_result messages. Keep only human/assistant turns. This forces the agent to re-fetch current data via fresh tool calls.
Key Rule
Filter out stale tool_results on session resumption β force fresh data fetches.
The Scratchpad Pattern
The Problem
In extended exploration sessions (30+ mins), accumulated token bloat causes inconsistent answers about early discoveries.
The Solution
Have the agent actively maintain a scratchpad file recording key findings, architectural maps, and decisions. It references this dense, structured file for subsequent questions rather than relying on raw message history.
Key Rule
Use a structured scratchpad file as continuous reference for long exploration sessions.
MCP Tool Specificity
The Problem
Providing a broad custom tool (analyze_dependencies) alongside built-in tools like Grep β the agent defaults to Grep and ignores the custom tool.
The Solution
Split broad tools into highly granular, single-purpose tools (list_imports, resolve_transitive_deps, detect_circular_deps). Enhance descriptions to explicitly detail capabilities and expected outputs.
Key Rule
Split broad tools into granular, single-purpose tools with explicit descriptions.
Shared Memory Architecture
The Problem
Daisy-chaining full conversation logs between subagents scales token costs exponentially.
The Solution
Decouple state from invocation. Have subagents index their outputs into a shared vector store. Subsequent agents use semantic search to retrieve only relevant prior findings. This prevents state loss if a multi-agent pipeline crashes mid-processing.
Key Rule
Use a shared vector store instead of passing full conversation logs between agents.
Goal-Oriented Delegation
The Problem
Giving subagents detailed step-by-step procedural instructions causes rigid execution that fails on edge cases or misses tangential sources.
The Solution
Specify research goals and quality criteria rather than procedural steps. Let the specialized subagent determine its own search strategy. Use enum parameters (e.g., analysis_type: extraction | summarization) to guide behavior without micromanaging.
Key Rule
Delegate goals and criteria, not step-by-step procedures.
Force Execution Order with tool_choice
The Problem
An agent needs to extract metadata before calling enrichment tools, but occasionally calls enrichment first, leading to failures.
The Solution
Use tool_choice: {type: 'tool', name: 'extract_metadata'} in the first API call to force the correct execution order. Don't rely on prompt instructions for sequencing β use the API's constraints.
Key Rule
Use tool_choice to enforce execution order β don't rely on prompt begging.
Structured Intermediate Representations
The Problem
Passing raw text from financial and news agents to a synthesis agent loses table clarity and narrative flow.
The Solution
Add a format conversion layer that standardizes all subagent outputs into a common JSON intermediate representation with claim, evidence, source, and confidence fields. Require all subagents to output structured claim-source mappings.
Key Rule
Standardize all agent outputs into a common structured format before synthesis.
Parallelization & Prompt Caching
The Problem
Sequential processing of 12 legal precedents takes over 3 minutes β unacceptable latency.
The Solution
Spawn parallel subagents, each handling a subset of precedents. For the synthesis step, enable prompt caching when accumulated findings exceed 80K tokens to reduce transfer overhead. Parallel execution reduces wall-clock time from 3+ minutes to ~30 seconds.
Key Rule
Parallelize independent work. Cache when accumulated context exceeds 80K tokens.
Branching Reality with fork_session
The Problem
An agent exploring multiple solution paths sequentially wastes time and context β each dead-end pollutes the context window for the next attempt.
The Solution
Use fork_session to clone the current conversation state into parallel branches. Each branch independently explores a different approach. Merge only the winning branch back into the main session, discarding failed explorations entirely.
Key Rule
Fork before exploring β never pollute the main context with experimental dead-ends.
Resumption in Dynamic Environments
The Problem
A support agent resumes a session but the customer's order status, payment, or ticket has changed since the last interaction β the agent uses stale cached data.
The Solution
On session resumption, inject a 'state_changed' system event listing what has changed since the last interaction. The agent acknowledges changes before continuing. Build a diff-aware resumption layer that compares current state to cached state.
Key Rule
Always inject a state-diff on resumption β never assume the world hasn't changed.
Escalation Handoff Payload
The Problem
When an agent escalates to a human, the human agent has no context β they ask the customer to repeat everything, destroying the experience.
The Solution
Build a structured escalation payload that includes: conversation summary, customer sentiment, actions already taken, reason for escalation, and recommended next steps. The human agent receives a complete briefing before taking over.
Key Rule
Every escalation must include a structured handoff payload β never escalate naked.
Directed Codebase Exploration
The Problem
A coding agent uses brute-force grep across an entire repository, consuming thousands of tokens on irrelevant files before finding what it needs.
The Solution
Start with high-level structural files (package.json, directory listings, README) to build a mental map. Then navigate precisely to relevant modules. Use the scratchpad pattern to record the codebase map for future reference.
Key Rule
Map first, then navigate β never brute-force search an unfamiliar codebase.
Streaming with Server-Sent Events
The Problem
Users stare at a blank screen for 10-30 seconds while the model generates a complete response β perceived latency destroys UX.
The Solution
Use the Streaming Messages API with Server-Sent Events (SSE). Stream tokens as they're generated so users see progressive output. Handle event types: message_start, content_block_delta, message_stop. Implement client-side token buffering for smooth rendering.
Key Rule
Always stream responses in user-facing applications β never make users wait for complete generation.
Prompt Caching Strategy
The Problem
Repeated API calls with identical system prompts and long reference documents burn through tokens β each call re-processes the same static content.
The Solution
Use prompt caching to mark static content (system prompts, reference docs, few-shot examples) as cacheable. Cache hits cost 90% less than re-processing. Structure messages so static content comes first, dynamic content last.
Key Rule
Put static content first, dynamic content last β maximize cache hit rate.
The Architect's Reference Matrix
The Problem
Architects struggle to choose between API endpoints, models, and patterns β no single reference maps scenarios to optimal configurations.
The Solution
Use a decision matrix mapping each scenario type to its optimal API endpoint, model, context strategy, and error handling approach. The matrix covers: data extraction (Batch API + strict schemas), support agents (Messages API + sliding window), code agents (Messages API + scratchpad), and multi-agent (parallel subagents + shared memory).
Key Rule
Consult the reference matrix before designing β don't reinvent the decision tree.
The Production Architecture Blueprint
Intelligence at the edges. Strict typing in the middle. Application intercepts guarding the core. Shared memory sustaining the lifecycle.