When AI Tests Are Wrong: Validation Patterns

Module 03: TDD with AI | Expansion Guide

Back to Module 03

The Problem

Your test suite is green. 100% passing. You deploy with confidence. Then production breaks and you realize: the tests were passing but they weren't actually testing anything. AI generated tests that looked right, ran successfully, but failed to catch the bug that just took down your service.

This is the false positive problem: tests that pass when they should fail, or test the wrong thing entirely.

AI-generated tests are particularly prone to this. They'll assert on values that are always true, mock everything so failures can't happen, or test implementation details that don't matter. The tests give false confidence.

The Core Insight

A test is only valuable if it can fail. If it can't detect broken code, it's worse than no test at all - it's a lie.

Before trusting any AI-generated test, you need to validate that: the test actually exercises the code, assertions are specific enough to catch bugs, and the test would fail if the code were broken. This requires intentional verification patterns.

The Walkthrough

Validation Pattern 1: The Mutation Test

The gold standard for test validation: break the code on purpose, verify the test catches it.

# Original implementation
def calculate_total(items, tax_rate):
    """Calculate total price with tax"""
    subtotal = sum(item.price for item in items)
    tax = subtotal * tax_rate
    return subtotal + tax

# AI-generated test
def test_calculate_total():
    items = [Item(price=10), Item(price=20)]
    total = calculate_total(items, 0.1)
    assert total == 33  # 30 + 3 tax = 33

Validation: Break the code in different ways, ensure test fails:

# Mutation 1: Remove tax calculation
def calculate_total(items, tax_rate):
    subtotal = sum(item.price for item in items)
    return subtotal  # BUG: forgot to add tax

# Run test - does it fail? YES ✓

# Mutation 2: Double the tax
def calculate_total(items, tax_rate):
    subtotal = sum(item.price for item in items)
    tax = subtotal * tax_rate * 2  # BUG: double tax
    return subtotal + tax

# Run test - does it fail? YES ✓

# Mutation 3: Wrong operator
def calculate_total(items, tax_rate):
    subtotal = sum(item.price for item in items)
    tax = subtotal * tax_rate
    return subtotal - tax  # BUG: subtraction instead of addition

# Run test - does it fail? YES ✓

If the test fails for all mutations, it's a good test. If it still passes with broken code, the test is weak.

Automated Mutation Testing

Tools like mutmut (Python), Stryker (JavaScript), or PITest (Java) automatically mutate your code and check if tests catch the mutations. Run these on AI-generated test suites.

Validation Pattern 2: The Assertion Strength Check

Weak assertions are the most common test smell. Check every assertion for specificity:

# WEAK: AI often generates these
def test_get_user():
    user = get_user(123)
    assert user  # BAD: Just checks truthiness

# Better, but still weak
def test_get_user():
    user = get_user(123)
    assert user is not None  # BAD: Doesn't verify it's correct user
    assert user.id  # BAD: Just checks id exists

# STRONG: Specific assertions
def test_get_user():
    user = get_user(123)
    assert user.id == 123
    assert user.email == "expected@example.com"
    assert user.name == "Expected Name"
    assert isinstance(user.created_at, datetime)

Validation checklist for assertions:

Validation Pattern 3: The "Always Passes" Check

Some AI-generated tests are tautologies - they can't fail because they test nothing:

# USELESS TEST: Always passes
def test_function_returns_something():
    result = my_function()
    assert result is not None  # This passes even if result is wrong

# USELESS TEST: Tests the test, not the code
def test_list_has_items():
    items = [1, 2, 3]  # Hardcoded in test
    assert len(items) == 3  # Of course it does!

# USELESS TEST: Mocked away
def test_api_call(mocker):
    mocker.patch('api.call', return_value={'status': 'ok'})
    result = process_api_call()
    assert result['status'] == 'ok'  # We mocked it to return this!

To validate, ask: "Could this test pass with completely broken production code?" If yes, delete it.

Validation Pattern 4: The Negative Case Test

AI often forgets to test error paths. For every function, there should be at least one test that expects failure:

# AI might generate only this:
def test_divide():
    result = divide(10, 2)
    assert result == 5

# But you need this too:
def test_divide_by_zero():
    with pytest.raises(ValueError):
        divide(10, 0)

def test_divide_with_non_numbers():
    with pytest.raises(TypeError):
        divide("10", "2")

Validation: For every success test, check if there's a corresponding failure test.

Common Bad Test Patterns

Pattern: The Brittle Mock

# BAD: Mocks make test pass but test nothing
def test_send_email(mocker):
    mock_smtp = mocker.patch('smtplib.SMTP')
    send_email("test@example.com", "Subject", "Body")
    mock_smtp.assert_called_once()  # Called, but did it work?

# BETTER: Test integration or use fakes
def test_send_email(fake_smtp):
    send_email("test@example.com", "Subject", "Body")
    sent = fake_smtp.get_sent_messages()
    assert len(sent) == 1
    assert sent[0].to == "test@example.com"
    assert sent[0].subject == "Subject"

Pattern: The Redundant Test

# These three tests are identical:
def test_add_positive_numbers():
    assert add(2, 3) == 5

def test_add_works():
    assert add(5, 7) == 12

def test_addition():
    assert add(1, 1) == 2

# Replace with one parameterized test:
@pytest.mark.parametrize("a,b,expected", [
    (2, 3, 5),
    (5, 7, 12),
    (-1, 1, 0),  # Also test negative
    (0, 0, 0),   # And zero
])
def test_add(a, b, expected):
    assert add(a, b) == expected

Pattern: The Implementation Test

# BAD: Tests how code works, not what it does
def test_cache_uses_redis():
    cache = Cache()
    cache.set("key", "value")
    assert cache._redis.get("cache:key") == "value"  # Private!

# GOOD: Tests behavior from public API
def test_cache_persists_values():
    cache = Cache()
    cache.set("key", "value")
    assert cache.get("key") == "value"  # Would work with any backend

Automated Validation Workflow

Add these checks to your test pipeline:

# .github/workflows/test-quality.yml
name: Test Quality Checks

on: [pull_request]

jobs:
  validate-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Run mutation testing
        run: |
          pip install mutmut
          mutmut run
          # Fail if mutation score < 80%
          mutmut results | grep "^mutation score" | \
            awk '{if ($3 < 80) exit 1}'

      - name: Check assertion strength
        run: |
          # Flag weak assertions
          grep -r "assert.*is not None$" tests/ && exit 1
          grep -r "assert result$" tests/ && exit 1

      - name: Require negative tests
        run: |
          # Ensure every test file has at least one raises/throws test
          python scripts/check_negative_tests.py

Don't Skip Manual Review

Automated checks catch common issues but not all. Always manually review AI-generated tests. Ask: "If I were an attacker trying to break this code, would this test catch me?"

The Nuclear Option: Delete and Rewrite

Sometimes AI-generated tests are so bad they're not worth fixing. Delete them and write from scratch if:

It's faster to rewrite than to fix fundamentally flawed tests.

Quick Reference

Test Validation Checklist:

  1. Mutation test: Break code, verify test fails
  2. Assertion check: Are assertions specific?
  3. Negative cases: Does it test error paths?
  4. Mock audit: Could test pass with broken real code?
  5. Redundancy check: Are tests meaningfully different?
  6. Public API test: Does it avoid testing internals?

Red Flags (Delete These Tests):

Mutation Testing Commands:

# Python
pip install mutmut
mutmut run
mutmut results

# JavaScript
npm install -g stryker
stryker run

# Java
mvn org.pitest:pitest-maven:mutationCoverage

Manual Validation Questions: