The Reflection Pattern in Practice

Module 11: Agentic System Design | Expansion Guide

Back to Module 11

The Problem

Your agent generates code. It looks good. You run it - syntax error on line 4. The agent confidently hallucinated a library function that doesn't exist. If only it had checked its own work before declaring success.

Agents make mistakes. But they can also catch them - if you teach them to look.

Most developers build agents that generate and move on. No self-checking, no verification, no "wait, does this even make sense?" The reflection pattern adds that critical second look.

The Core Insight

Reflection is critique as a system component, not an afterthought.

Think of it like code review: you don't ship without review because you know your own biases blind you to bugs. Agents have the same problem, but they can be their own reviewer if you build in the reflection loop.

The pattern is simple: Generate → Critique → Refine → Repeat. The magic is in making critique automatic and systematic.

The Walkthrough

Basic Reflection Loop

def agent_with_reflection(task):
    # Step 1: Generate initial solution
    output = agent.generate(task)

    # Step 2: Reflect on the output
    critique = agent.critique(
        task=task,
        output=output,
        criteria=["correctness", "completeness", "edge cases"]
    )

    # Step 3: Refine based on critique
    if critique.has_issues:
        output = agent.refine(
            original_output=output,
            critique=critique
        )

    return output

The Critique Prompt Pattern

The critique step is where the magic happens. Make it specific:

critique_prompt = f"""
You generated this code:
{generated_code}

Review it for:
1. Syntax errors (does this even run?)
2. Logical errors (does it do what was asked?)
3. Edge cases (what breaks this?)
4. Hallucinations (are you using real APIs/functions?)

For each issue found:
- Severity: critical/major/minor
- Location: where in the code
- Suggestion: how to fix

If no issues, respond with "APPROVED"
"""

Multi-Pass Reflection

Different critique lenses for different passes:

Pass Focus Question
1. Correctness Does it work? "Run this mentally. Does it produce the right output?"
2. Completeness Does it handle all cases? "What inputs would break this?"
3. Quality Is it maintainable? "Would you accept this in code review?"
4. Safety Can it cause harm? "What's the worst that could happen?"

Example: Code Generation with Reflection

# Step 1: Generate
code = agent.generate("Write a function to fetch user data from API")

# Step 2: First Reflection - Correctness
critique_1 = agent.reflect(f"""
Does this code work?
{code}

Check:
- Are imports real?
- Is the API call syntax correct?
- Will this run without errors?
""")

# Step 3: Refine if needed
if "issues found" in critique_1:
    code = agent.refine(code, critique_1)

# Step 4: Second Reflection - Edge Cases
critique_2 = agent.reflect(f"""
What edge cases are missing?
{code}

Consider:
- API timeout
- Malformed response
- Network errors
- Invalid user ID
""")

# Step 5: Final refinement
if "issues found" in critique_2:
    code = agent.refine(code, critique_2)

return code

The Verification Tool Pattern

Instead of asking the agent to "imagine" if code works, give it a tool to actually test:

tools = [
    verify_syntax(code),  # Run linter
    execute_in_sandbox(code),  # Actually run it
    check_imports(code)  # Verify libraries exist
]

Reflection with verification is more reliable than pure critique.

The Self-Consistency Check

Generate multiple solutions, have agent pick the best:

# Generate 3 different solutions
solutions = [
    agent.generate(task),
    agent.generate(task),
    agent.generate(task)
]

# Agent critiques all and picks best
best = agent.select_best(f"""
Here are 3 solutions to: {task}

Solution A: {solutions[0]}
Solution B: {solutions[1]}
Solution C: {solutions[2]}

Which is best? Why? What would you improve?
""")

Failure Patterns

1. The Rubber-Stamp Reflection

Symptom: Agent critiques its own work and always says "looks good."

Fix: Make critique adversarial. Prompt: "You must find at least one issue, even if minor."

2. The Infinite Loop

Symptom: Agent keeps finding issues, refining, finding new issues, never finishing.

Fix: Set a max iteration limit (2-3 passes). After that, ship it.

3. The Vague Critique

Symptom: Critique says "this could be better" without specifics.

Fix: Force structured output: severity, location, exact fix needed.

4. The Same-Model Blindspot

Symptom: Agent makes mistake, then approves its own mistake in reflection.

Fix: Use a different model (or different family) for critique — e.g., Sonnet 4.6 for generation, Opus 4.6 for critique, or swap to Gemini 2.5 / GPT-5 for adversarial review.

The Confidence Paradox

Agents that are confident in their output are less likely to find issues in reflection. You may need to explicitly prompt: "Be skeptical. Assume there are bugs."

Advanced Pattern: Hierarchical Reflection

For complex tasks, reflect at multiple levels:

# Level 1: Line-by-line review
for line in code.split('\n'):
    critique_line(line)

# Level 2: Function-level review
for function in extract_functions(code):
    critique_function(function)

# Level 3: Architecture review
critique_overall_design(code)

Example: SQL Generation with Reflection

Without Reflection

query = agent.generate("Get all users who signed up last month")
# Returns: SELECT * FROM users WHERE signup_date > '2024-01-01'
# Issue: Hardcoded date, selects all columns, no indexes

With Reflection

query = agent.generate("Get all users who signed up last month")

critique = agent.reflect(f"""
Review this SQL:
{query}

Check:
1. Does it actually get "last month" (not hardcoded date)?
2. Are we selecting only needed columns?
3. Will this be slow on a large table?
4. Any SQL injection risks?
""")

# Agent finds issues, refines to:
# SELECT id, email, signup_date
# FROM users
# WHERE signup_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
# AND signup_date < CURDATE()
# USE INDEX (idx_signup_date)

Quick Reference

Basic Reflection Loop:

  1. Generate initial output
  2. Critique with specific criteria
  3. Refine based on critique
  4. Repeat 1-2 times max

Critique Prompt Checklist:

When to Use Reflection:

Rule of Thumb:

If a human would review it before shipping, your agent should too. Reflection is automated QA.