🏗️Agentic ArchitectureLesson 1.5

Evaluator-Optimizer Pattern

Iterative loops with evaluation and refinement for quality output.

25 min

Learning Objectives

Build evaluation criteria for iterative refinement
Implement feedback loops between evaluator and generator
Set termination conditions for optimization loops

Evaluator-Optimizer Pattern

The evaluator-optimizer pattern introduces iterative refinement into agentic workflows. Rather than accepting the first output a model produces, this pattern creates a feedback loop: a generator produces output, an evaluator scores or critiques it against defined criteria, and if the output falls short, the generator tries again with the evaluation feedback. This loop continues until the output meets a quality threshold or a maximum iteration count is reached.

How the Pattern Works

The evaluator-optimizer pattern follows a simple cycle:

Generate: The generator LLM produces an initial output based on the task instructions.
Evaluate: The evaluator (a separate LLM call, or deterministic code, or both) assesses the output against a defined rubric or criteria.
Decide: If the evaluation passes (meets quality threshold), return the output. If it fails, feed the evaluation feedback back to the generator.
Refine: The generator produces a new version, informed by the evaluator's critique. Go to step 2.

The key insight is that the evaluator's feedback gives the generator specific, actionable direction for improvement — rather than just re-running the same prompt and hoping for a better result.

Setting Evaluation Criteria

The quality of the evaluator-optimizer loop depends entirely on the quality of the evaluation criteria. Effective criteria are:

Specific and measurable: "The output must contain at least 3 code examples" is better than "The output should be detailed."
Objective where possible: Deterministic checks (valid JSON? correct length? all required fields present?) should be preferred over subjective LLM judgments where applicable.
Prioritized: Not all criteria are equally important. A rubric should distinguish between hard requirements (must pass) and soft preferences (nice to have).

Types of Evaluators

Deterministic evaluators: Code-based checks — JSON schema validation, regex matching, length constraints, required keyword presence. Fast, free, and reliable.
LLM evaluators: A second model call that reads the output and scores it against a rubric. More flexible but adds cost and latency. Useful for subjective criteria like "is this explanation clear?" or "does this code follow best practices?"
Hybrid evaluators: Run deterministic checks first (fast, cheap), then invoke an LLM evaluator only if the deterministic checks pass. This minimizes unnecessary LLM evaluation calls.

Termination Conditions

Every evaluator-optimizer loop must have clear termination conditions to prevent infinite iteration:

Quality threshold: The evaluation score exceeds a predefined minimum (e.g., 8/10 on a rubric, or all required fields present and valid).
Maximum iterations: A hard cap on the number of refinement cycles (typically 2-4). Beyond this, diminishing returns set in and costs escalate.
Diminishing returns detection: If the evaluation score does not improve between consecutive iterations, stop early — the generator has likely reached its ceiling for the given prompt and criteria.
Time or budget limit: In production systems, a maximum token budget or wall-clock time limit may override quality-based conditions.

Code Example: Evaluator-Optimizer for Code Generation

import anthropic
import json

client = anthropic.Anthropic()

def llm(prompt: str, system: str = "") -> str:
    kwargs = {"model": "claude-sonnet-4-5-20250514", "max_tokens": 2048,
              "messages": [{"role": "user", "content": prompt}]}
    if system:
        kwargs["system"] = system
    return client.messages.create(**kwargs).content[0].text.strip()

# ---------- Generator ----------
def generate_code(task: str, feedback: str = "") -> str:
    """Generate or refine code based on a task description and optional feedback."""
    prompt = f"Write a Python function for the following task:\n\n{task}"
    if feedback:
        prompt += f"""\n\nYour previous attempt received the following feedback.
Please address all issues:\n\n{feedback}"""
    return llm(prompt, system="""You are an expert Python developer. Write clean,
well-documented code with type hints and docstrings. Return ONLY the code,
no explanations.""")

# ---------- Evaluator ----------
def evaluate_code(code: str, task: str) -> dict:
    """Evaluate the generated code against quality criteria."""
    prompt = f"""Evaluate the following Python code against these criteria.
For each criterion, respond with PASS or FAIL and a brief explanation.

Criteria:
1. Correctness: Does it correctly solve the task?
2. Type hints: Does it include type annotations?
3. Docstring: Does it have a clear docstring?
4. Edge cases: Does it handle edge cases (empty input, invalid types)?
5. Readability: Is the code clean and well-structured?

Task: {task}

Code:
{code}

Respond as JSON with this format:
{{"overall_pass": true/false, "score": <number 0-5>, "criteria": [...], "feedback": "..."}}"""

    raw = llm(prompt, system="You are a senior code reviewer. Be thorough but fair.")
    raw = raw.removeprefix("```json").removeprefix("```").removesuffix("```").strip()
    return json.loads(raw)

# ---------- Evaluator-Optimizer Loop ----------
def generate_with_refinement(task: str, max_iterations: int = 3,
                              quality_threshold: int = 4) -> dict:
    """Run the eval-opt loop until quality is met or iterations are exhausted."""
    feedback = ""

    for iteration in range(1, max_iterations + 1):
        print(f"\n[Iteration {iteration}] Generating code...")
        code = generate_code(task, feedback)

        print(f"[Iteration {iteration}] Evaluating...")
        evaluation = evaluate_code(code, task)
        score = evaluation.get("score", 0)
        print(f"[Iteration {iteration}] Score: {score}/{quality_threshold}")

        if score >= quality_threshold:
            print(f"[Iteration {iteration}] Quality threshold met!")
            return {"code": code, "iterations": iteration,
                    "final_score": score, "evaluation": evaluation}

        # Feed evaluation back for next iteration
        feedback = evaluation.get("feedback", "Please improve the code quality.")
        print(f"[Iteration {iteration}] Feedback: {feedback}")

    print(f"[Max iterations reached] Returning best effort.")
    return {"code": code, "iterations": max_iterations,
            "final_score": score, "evaluation": evaluation}

# Example
result = generate_with_refinement(
    "Write a function that flattens a nested list of arbitrary depth."
)
print(f"\nFinal code (after {result['iterations']} iteration(s)):")
print(result['code'])

When NOT to Use This Pattern

Simple, well-constrained tasks: If the generator consistently produces correct output on the first try, adding an evaluation loop is unnecessary overhead.
Real-time requirements: Each iteration adds latency. If response time is critical (e.g., chatbot interactions), iterative refinement may be too slow.
Subjective tasks without clear criteria: If you can't define what "good" looks like in measurable terms, the evaluator won't provide useful feedback.
Cost-sensitive workloads: Each iteration consumes additional tokens. For high-volume, low-value tasks, the cost of refinement may exceed the value of improved quality.

Exam Tip: The exam tests whether you can set appropriate termination conditions for evaluator-optimizer loops. A question might describe a system that loops indefinitely or converges too slowly — the correct answer will involve adding a max iteration cap, a quality threshold, or both. Always remember: without termination conditions, an eval-opt loop can consume unbounded tokens and time.

Key Takeaways

The evaluator-optimizer pattern creates a feedback loop: generate → evaluate → refine → repeat. It transforms single-shot generation into iterative refinement.

Evaluation quality drives loop quality. Use specific, measurable criteria. Prefer deterministic checks where possible, and use LLM evaluators for subjective criteria.

Always set termination conditions: max iterations, quality threshold, or diminishing returns detection. Never allow an eval-opt loop to run indefinitely.

Use this pattern selectively. It adds latency and cost. Reserve it for tasks where first-pass quality is genuinely insufficient and clear improvement criteria exist.