The Problem
An agent writes code for you. It works, but it's not great. No error handling. Magic numbers scattered everywhere. Inefficient algorithm. You could ask it to revise, but you'd have to manually review and give feedback. Again. And again.
What if a second agent automatically reviewed the first agent's work, provided critique, and the first agent revised based on that feedback? A self-improving loop with no human in the middle.
Sounds perfect. But you run it and: infinite loop. The critic always finds something to improve. They've been iterating for 15 minutes and burned through $5 in API costs. The code isn't better, just different. Generator-critic loops without exit conditions become money pits.
The Core Insight
Critic agents work when you have clear quality criteria and hard exit conditions. Without both, they iterate forever or stop too early.
Think of it like peer review in academia:
- Good review: Reviewer gives specific feedback (add citations, clarify methods). Author revises. Reviewer approves or requests targeted changes. Process converges.
- Bad review: Reviewer always finds something ("this could be better"). Author revises endlessly. Paper never ships.
The pattern requires three components:
| Component | Purpose | Example |
|---|---|---|
| Generator | Creates initial output | Writes Python function |
| Critic | Reviews and scores quality | Checks for errors, style, efficiency |
| Exit Condition | Decides when to stop | Score > 8/10 OR 5 iterations |
The Walkthrough
Step 1: Generator Agent
The generator creates initial output. It doesn't need to be perfect - the critic will improve it:
from anthropic import Anthropic
client = Anthropic()
def generator_agent(task: str, critique: str = None) -> str:
"""Generate code. If critique provided, revise based on it."""
if critique:
prompt = f"""Revise this code based on the critique below.
Original task: {task}
Critique:
{critique}
Provide the revised code."""
else:
prompt = f"""Write code for this task: {task}
Focus on getting it working. Don't worry about perfection."""
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# First generation
code = generator_agent("Write a function that finds the longest palindrome substring")
print(code)
Step 2: Critic Agent with Scoring
The critic evaluates quality and returns a numeric score plus actionable feedback:
def critic_agent(code: str, criteria: dict) -> dict:
"""Critique code and return score + feedback."""
prompt = f"""You are a code critic. Evaluate this code against the criteria below.
Code:
{code}
Criteria:
{format_criteria(criteria)}
Return JSON with:
{{
"score": 0-10 (overall quality),
"feedback": "Specific, actionable improvements",
"passes": true/false (meets minimum threshold)
}}
Be strict but fair. Score 10 means production-ready. Score < 7 needs revision."""
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1500,
messages=[{"role": "user", "content": prompt}]
)
import json
return json.loads(response.content[0].text)
def format_criteria(criteria: dict) -> str:
"""Format criteria for prompt."""
return "\n".join([f"- {k}: {v}" for k, v in criteria.items()])
# Example criteria
criteria = {
"correctness": "Algorithm produces correct results",
"efficiency": "Time complexity is optimal",
"readability": "Code is clear and well-commented",
"error_handling": "Handles edge cases gracefully"
}
critique = critic_agent(code, criteria)
# Returns: {"score": 6, "feedback": "Missing edge case handling for empty strings...", "passes": false}
Step 3: Iteration Loop with Exit Conditions
Loop between generator and critic until exit condition is met:
def generator_critic_loop(
task: str,
criteria: dict,
max_iterations: int = 5,
target_score: float = 8.0
) -> dict:
"""Run generator-critic loop until quality threshold or max iterations."""
code = None
history = []
for iteration in range(max_iterations):
print(f"\nš Iteration {iteration + 1}/{max_iterations}")
# Generate (or revise)
if iteration == 0:
code = generator_agent(task)
else:
critique_text = history[-1]["feedback"]
code = generator_agent(task, critique=critique_text)
# Critique
critique = critic_agent(code, criteria)
history.append({
"iteration": iteration + 1,
"code": code,
"score": critique["score"],
"feedback": critique["feedback"]
})
print(f" Score: {critique['score']}/10")
print(f" Feedback: {critique['feedback'][:100]}...")
# Exit conditions
if critique["score"] >= target_score:
print(f"\nā Target score reached: {critique['score']}/10")
break
if not critique["passes"]:
if iteration == max_iterations - 1:
print(f"\nā ļø Max iterations reached. Best score: {critique['score']}/10")
else:
if iteration < max_iterations - 1:
print(" ā Revising based on feedback...")
return {
"final_code": code,
"final_score": history[-1]["score"],
"iterations": len(history),
"history": history
}
# Run the loop
result = generator_critic_loop(
task="Write a function that finds the longest palindrome substring",
criteria=criteria,
max_iterations=5,
target_score=8.0
)
print(f"\nšÆ Final Result:")
print(f" Score: {result['final_score']}/10")
print(f" Iterations: {result['iterations']}")
print(f"\n{result['final_code']}")
Sample output:
š Iteration 1/5
Score: 5/10
Feedback: Missing edge case for empty strings. No error handling...
ā Revising based on feedback...
š Iteration 2/5
Score: 7/10
Feedback: Good error handling added. Consider optimizing with dynamic programming...
ā Revising based on feedback...
š Iteration 3/5
Score: 8.5/10
Feedback: Excellent. Minor: add type hints for clarity
ā Target score reached: 8.5/10
šÆ Final Result:
Score: 8.5/10
Iterations: 3
Two Exit Conditions, Always
Always use BOTH a quality threshold AND a max iteration limit. Quality ensures good output. Max iterations prevents infinite loops. Never have just one.
Advanced: Diminishing Returns Detection
Sometimes the score plateaus - iterations aren't helping. Detect this and exit early:
def has_plateau(history: list, window: int = 3, threshold: float = 0.2) -> bool:
"""Check if score improvement has stalled."""
if len(history) < window:
return False
recent_scores = [h["score"] for h in history[-window:]]
score_range = max(recent_scores) - min(recent_scores)
return score_range < threshold # Less than 0.2 point change
# In the loop
if has_plateau(history):
print(f"\nā¹ļø Plateau detected. Scores not improving. Stopping.")
break
When Critic Loops Work
Good use cases:
- Code quality improvement: Clear criteria (tests pass, no linting errors, coverage > 80%)
- Content refinement: Writing with measurable quality (reading level, clarity score)
- Design iteration: With quantifiable goals (accessibility score, load time)
- Test generation: Coverage percentage as quality metric
Bad use cases (don't use critic loops):
- Subjective criteria ("make it better")
- No clear threshold (when is "good enough"?)
- Tasks that need human judgment
- When generator and critic have same knowledge (they'll agree or disagree forever)
Failure Patterns
1. Infinite Loop
Symptom: Loop runs for 20 iterations, burning money. No convergence.
Fix: Always set max_iterations. Start with 3-5. If you hit the limit regularly, your criteria are too strict.
2. Oscillation
Symptom: Generator fixes issue A, critic complains about B. Generator fixes B, breaks A. Repeat.
Fix: Track what was fixed. Tell generator not to regress:
# In generator prompt:
"Previous iteration fixed: {fixed_issues}. Maintain those fixes while addressing: {new_critique}"
3. Trivial Changes
Symptom: Critic nitpicks style issues. Generator changes variable names. Nothing substantive improves.
Fix: Prioritize criteria. Focus on correctness, then efficiency, then style:
criteria = {
"correctness": {"weight": 10, "description": "Algorithm is correct"},
"efficiency": {"weight": 5, "description": "Optimal time complexity"},
"style": {"weight": 2, "description": "Follows conventions"}
}
4. Cost Explosion
Symptom: Each iteration costs $0.50. Five iterations = $2.50 for one function.
Fix: Use cheaper models for critic. Generator uses Sonnet, critic uses Haiku:
def critic_agent(code: str, criteria: dict) -> dict:
response = client.messages.create(
model="claude-haiku-4-5-20250929", # Cheaper, faster for critique
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
# ...
The Perfectionism Trap
AI critics can always find something to improve. Don't chase perfection. Set "good enough" thresholds based on context. Internal tool? Score 7 is fine. Public API? Demand 9. Match rigor to risk.
Real-World Example: Test Generation Loop
def generate_tests_with_critic(function_code: str) -> str:
"""Generate comprehensive tests using generator-critic loop."""
criteria = {
"coverage": "Tests cover all code paths",
"edge_cases": "Tests include boundary conditions, null, empty, large inputs",
"assertions": "Assertions check behavior, not just that code runs",
"readability": "Test names clearly describe what's being tested"
}
result = generator_critic_loop(
task=f"Write pytest tests for this function:\n\n{function_code}",
criteria=criteria,
max_iterations=4,
target_score=8.5
)
return result["final_code"]
# Usage
function = """
def longest_palindrome(s: str) -> str:
# ... implementation ...
"""
tests = generate_tests_with_critic(function)
# Returns high-quality test suite after 2-3 iterations
Quick Reference
Critic Loop Checklist:
- Define clear criteria: What makes output good?
- Make criteria measurable: Use numeric scores
- Set target score: When is "good enough"?
- Set max iterations: Usually 3-5
- Detect plateaus: Stop if no improvement
Implementation Pattern:
for i in range(max_iterations):
output = generator(task, previous_critique)
critique = critic(output, criteria)
if critique.score >= target_score:
break # Good enough
if has_plateau(history):
break # Not improving
return output
Exit Conditions (use multiple):
- Score threshold reached (e.g., score >= 8)
- Max iterations hit (e.g., 5 loops)
- Plateau detected (last 3 scores within 0.2)
- Cost limit exceeded (track API spend)
Cost Optimization:
- Use cheaper model for critic (Haiku vs Sonnet)
- Limit iterations to 3-5
- Cache repeated critiques
- Start with single-agent, add critic only when needed