The Problem
Your test suite is green. 100% passing. You deploy with confidence. Then production breaks and you realize: the tests were passing but they weren't actually testing anything. AI generated tests that looked right, ran successfully, but failed to catch the bug that just took down your service.
This is the false positive problem: tests that pass when they should fail, or test the wrong thing entirely.
AI-generated tests are particularly prone to this. They'll assert on values that are always true, mock everything so failures can't happen, or test implementation details that don't matter. The tests give false confidence.
The Core Insight
A test is only valuable if it can fail. If it can't detect broken code, it's worse than no test at all - it's a lie.
Before trusting any AI-generated test, you need to validate that: the test actually exercises the code, assertions are specific enough to catch bugs, and the test would fail if the code were broken. This requires intentional verification patterns.
The Walkthrough
Validation Pattern 1: The Mutation Test
The gold standard for test validation: break the code on purpose, verify the test catches it.
# Original implementation
def calculate_total(items, tax_rate):
"""Calculate total price with tax"""
subtotal = sum(item.price for item in items)
tax = subtotal * tax_rate
return subtotal + tax
# AI-generated test
def test_calculate_total():
items = [Item(price=10), Item(price=20)]
total = calculate_total(items, 0.1)
assert total == 33 # 30 + 3 tax = 33
Validation: Break the code in different ways, ensure test fails:
# Mutation 1: Remove tax calculation
def calculate_total(items, tax_rate):
subtotal = sum(item.price for item in items)
return subtotal # BUG: forgot to add tax
# Run test - does it fail? YES ✓
# Mutation 2: Double the tax
def calculate_total(items, tax_rate):
subtotal = sum(item.price for item in items)
tax = subtotal * tax_rate * 2 # BUG: double tax
return subtotal + tax
# Run test - does it fail? YES ✓
# Mutation 3: Wrong operator
def calculate_total(items, tax_rate):
subtotal = sum(item.price for item in items)
tax = subtotal * tax_rate
return subtotal - tax # BUG: subtraction instead of addition
# Run test - does it fail? YES ✓
If the test fails for all mutations, it's a good test. If it still passes with broken code, the test is weak.
Automated Mutation Testing
Tools like mutmut (Python), Stryker (JavaScript), or PITest (Java) automatically mutate your code and check if tests catch the mutations. Run these on AI-generated test suites.
Validation Pattern 2: The Assertion Strength Check
Weak assertions are the most common test smell. Check every assertion for specificity:
# WEAK: AI often generates these
def test_get_user():
user = get_user(123)
assert user # BAD: Just checks truthiness
# Better, but still weak
def test_get_user():
user = get_user(123)
assert user is not None # BAD: Doesn't verify it's correct user
assert user.id # BAD: Just checks id exists
# STRONG: Specific assertions
def test_get_user():
user = get_user(123)
assert user.id == 123
assert user.email == "expected@example.com"
assert user.name == "Expected Name"
assert isinstance(user.created_at, datetime)
Validation checklist for assertions:
- Does it check the exact expected value, not just "something"?
- Does it verify type, not just truthiness?
- Does it check all important fields, not just one?
- Would it fail if the function returned nonsense?
Validation Pattern 3: The "Always Passes" Check
Some AI-generated tests are tautologies - they can't fail because they test nothing:
# USELESS TEST: Always passes
def test_function_returns_something():
result = my_function()
assert result is not None # This passes even if result is wrong
# USELESS TEST: Tests the test, not the code
def test_list_has_items():
items = [1, 2, 3] # Hardcoded in test
assert len(items) == 3 # Of course it does!
# USELESS TEST: Mocked away
def test_api_call(mocker):
mocker.patch('api.call', return_value={'status': 'ok'})
result = process_api_call()
assert result['status'] == 'ok' # We mocked it to return this!
To validate, ask: "Could this test pass with completely broken production code?" If yes, delete it.
Validation Pattern 4: The Negative Case Test
AI often forgets to test error paths. For every function, there should be at least one test that expects failure:
# AI might generate only this:
def test_divide():
result = divide(10, 2)
assert result == 5
# But you need this too:
def test_divide_by_zero():
with pytest.raises(ValueError):
divide(10, 0)
def test_divide_with_non_numbers():
with pytest.raises(TypeError):
divide("10", "2")
Validation: For every success test, check if there's a corresponding failure test.
Common Bad Test Patterns
Pattern: The Brittle Mock
# BAD: Mocks make test pass but test nothing
def test_send_email(mocker):
mock_smtp = mocker.patch('smtplib.SMTP')
send_email("test@example.com", "Subject", "Body")
mock_smtp.assert_called_once() # Called, but did it work?
# BETTER: Test integration or use fakes
def test_send_email(fake_smtp):
send_email("test@example.com", "Subject", "Body")
sent = fake_smtp.get_sent_messages()
assert len(sent) == 1
assert sent[0].to == "test@example.com"
assert sent[0].subject == "Subject"
Pattern: The Redundant Test
# These three tests are identical:
def test_add_positive_numbers():
assert add(2, 3) == 5
def test_add_works():
assert add(5, 7) == 12
def test_addition():
assert add(1, 1) == 2
# Replace with one parameterized test:
@pytest.mark.parametrize("a,b,expected", [
(2, 3, 5),
(5, 7, 12),
(-1, 1, 0), # Also test negative
(0, 0, 0), # And zero
])
def test_add(a, b, expected):
assert add(a, b) == expected
Pattern: The Implementation Test
# BAD: Tests how code works, not what it does
def test_cache_uses_redis():
cache = Cache()
cache.set("key", "value")
assert cache._redis.get("cache:key") == "value" # Private!
# GOOD: Tests behavior from public API
def test_cache_persists_values():
cache = Cache()
cache.set("key", "value")
assert cache.get("key") == "value" # Would work with any backend
Automated Validation Workflow
Add these checks to your test pipeline:
# .github/workflows/test-quality.yml
name: Test Quality Checks
on: [pull_request]
jobs:
validate-tests:
runs-on: ubuntu-latest
steps:
- name: Run mutation testing
run: |
pip install mutmut
mutmut run
# Fail if mutation score < 80%
mutmut results | grep "^mutation score" | \
awk '{if ($3 < 80) exit 1}'
- name: Check assertion strength
run: |
# Flag weak assertions
grep -r "assert.*is not None$" tests/ && exit 1
grep -r "assert result$" tests/ && exit 1
- name: Require negative tests
run: |
# Ensure every test file has at least one raises/throws test
python scripts/check_negative_tests.py
Don't Skip Manual Review
Automated checks catch common issues but not all. Always manually review AI-generated tests. Ask: "If I were an attacker trying to break this code, would this test catch me?"
The Nuclear Option: Delete and Rewrite
Sometimes AI-generated tests are so bad they're not worth fixing. Delete them and write from scratch if:
- More than 50% of assertions are weak
- No negative test cases at all
- Heavy mocking that tests mocks, not code
- Tests are all variations of the same scenario
- You can't explain what the test actually validates
It's faster to rewrite than to fix fundamentally flawed tests.
Quick Reference
Test Validation Checklist:
- Mutation test: Break code, verify test fails
- Assertion check: Are assertions specific?
- Negative cases: Does it test error paths?
- Mock audit: Could test pass with broken real code?
- Redundancy check: Are tests meaningfully different?
- Public API test: Does it avoid testing internals?
Red Flags (Delete These Tests):
assert result- No specific value checkassert result is not None- Truthy check only- Heavy mocking that eliminates failure paths
- Test creates data then asserts on that same data
- Tests private methods or attributes
- No docstring explaining what scenario is tested
Mutation Testing Commands:
# Python
pip install mutmut
mutmut run
mutmut results
# JavaScript
npm install -g stryker
stryker run
# Java
mvn org.pitest:pitest-maven:mutationCoverage
Manual Validation Questions:
- If I comment out a line of production code, does this test fail?
- If I swap return values, does this test catch it?
- Can I explain this test to a junior dev in one sentence?
- Would this test prevent a real bug I've seen before?