Cost Envelopes: The Pattern That Saved Us $40k

The Problem

Pick the version that's yours.

TEAM: You shipped an agent doing PR review across your dev team. $0.30/review in QA. The team was happy, finance was happy, you went home Friday. Tuesday morning: $11,400 on your Anthropic key in 36 hours from one tool-call loop.

SOLO: You shipped one ticket-triage agent for your two-person startup. $0.30/ticket in testing. You felt good about the math. Tuesday morning: $4,800 from a single ticket thread the agent kept “clarifying” overnight.

BOT: You shipped customer-facing AI completing invoice line items. $0.02/completion in QA. You shipped to all 12,000 users on Monday. Tuesday morning: a Substack post brought 200 new signups, three of them used your AI as a personal assistant, and your monthly margin is now -180%.

Then Monday's invoice arrived. Three modes, one failure: cost was a runtime constraint and you didn't enforce one.

The Core Insight

Cost is a runtime constraint, not a model-pick decision. The cheapest agent is the one with a hard ceiling.

Most teams treat cost as a model-selection problem (“Sonnet vs Opus”) or a context-shaping problem (“trim the prompt”). Both matter. Neither saves you when an agent loops.

The pattern that saves you is a cost envelope attached to every task: a hard token budget, a hard wall-clock budget, and an attribution tag. When the agent hits either limit, the harness kills it — before the model gets a chance to reason about whether it should keep going. The whole point is to make cost discipline a non-negotiable runtime check, not a polite suggestion the agent might honour if it remembers.

// across actors

TEAM: Per-engineer Copilot budgets + alerts when one developer's spend doubles week-over-week.

SOLO: Per-task envelope on your agent runs ($X for code review, $Y for triage) + a kill-switch when a single run exceeds 5× the historical median.

BOT: Customer-facing AI feature with a per-customer cost ceiling so one power user can't bankrupt the unit economics.

The Walkthrough

1. Define envelopes per task type

Three tiers cover most workloads. Pick the smallest tier that's plausibly enough; don't default to large. The token cap is on the sum of input + output across the whole task (every retry, every tool round-trip, every reflection step). Wall-clock is from task-start to forced-stop; the agent doesn't get to negotiate.

Tier	Token Cap	Wall-Clock Cap	Use For
Small	50k	90s	Bot replies, lint fixes, single-file suggestions, slash-command handlers
Medium	500k	10min	Code reviews, scoped refactors, PR descriptions, doc rewrites
Large	5M	1hr	Migrations, audits, repo-wide analyses, deep multi-file refactors

If a task wants more than Large, that's not a bigger envelope — that's a different shape of work. Break it up, hand each piece its own envelope, aggregate the results.

2. Track running spend in a per-task ledger

Don't trust the agent to track its own spend. Track it in your harness, in storage you control. The minimum schema:

-- one row per task, updated on every API call
CREATE TABLE agent_task_ledger (
    task_id      uuid PRIMARY KEY,
    tier         text NOT NULL,        -- 'small' | 'medium' | 'large'
    token_budget bigint NOT NULL,
    tokens_used  bigint NOT NULL DEFAULT 0,
    wall_budget  interval NOT NULL,
    started_at   timestamptz NOT NULL,
    state        text NOT NULL,        -- 'running' | 'killed' | 'done' | 'escalated'
    -- attribution
    team_slug    text NOT NULL,
    project_slug text NOT NULL,
    user_id      uuid,
    -- metadata
    cause_kind   text,                 -- 'pr_review' | 'cron' | 'webhook' | ...
    cause_id     text                  -- the PR number, cron name, webhook id
);
CREATE INDEX ON agent_task_ledger (team_slug, started_at DESC);

Every API call goes through middleware that updates tokens_used in the same transaction as the call's bookkeeping. No middleware bypass; no “just for this one call.”

3. Hard-stop when the envelope is exhausted

The middleware checks the envelope before dispatching the call and after the call returns. If either check fails, the task transitions to killed and a BudgetExceededError bubbles up to the harness. The agent doesn't get another turn.

// before each call
async function dispatch(task, request) {
    const ledger = await loadLedger(task.id);
    if (ledger.state !== 'running') {
        throw new BudgetExceededError(`task already ${ledger.state}`);
    }
    if (ledger.tokens_used >= ledger.token_budget) {
        await markKilled(task.id, 'token_cap');
        throw new BudgetExceededError('token cap hit');
    }
    if (Date.now() - ledger.started_at > ledger.wall_budget_ms) {
        await markKilled(task.id, 'wall_clock');
        throw new BudgetExceededError('wall clock hit');
    }
    const response = await anthropic.messages.create(request);
    await chargeTokens(task.id, response.usage.input_tokens + response.usage.output_tokens);
    return response;
}

Kill fail-closed. If the ledger is unreachable, the task does not get to call out — you'd rather lose a task than guess at budget state. The retry path is for the harness, not the agent.

4. Attribution tags on every call

Every API call carries metadata that maps it back to a task, team, project, and human. This is non-optional — it's how you answer the “whose budget is this agent burning” question without parsing logs after the bill arrives.

// every API call attaches metadata
const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-7',
    max_tokens: 4096,
    messages,
    metadata: {
        user_id: `task:${task.id}`,        // shows in Anthropic console
    },
    extra_headers: {
        'X-PB-Task':    task.id,
        'X-PB-Team':    task.team_slug,
        'X-PB-Project': task.project_slug,
        'X-PB-Tier':    task.tier,
        'X-PB-Cause':   `${task.cause_kind}:${task.cause_id}`
    }
});

The headers are for your own observability stack (you log them on every request, see Module 19). The metadata.user_id field is the one Anthropic's console actually shows — using task:<uuid> as the value lets you grep the console for a specific task without leaking real user IDs into Anthropic.

5. Escalation: the agent can ask, you decide

Some tasks legitimately need more. A migration the harness scoped as “Medium” turns out to need 8M tokens because the codebase grew. You don't want to hard-fail in that case — but you don't want the agent to silently extend itself either.

The escalation path: when the agent is within 10% of an envelope, it emits a structured request (a tool call to your harness, not a chat message to a human). The harness pages an on-call human or a team channel with one click to approve, deny, or upgrade tier. Approved escalations are logged with the approver's identity — the audit trail matters when the same agent escalates three times in a week.

// the math on doing this for real

For a fleet of ~40 agent tasks/day across 6 teams, a typical week looks like:

~280 task envelopes claimed (mostly Small, some Medium)
3–6 envelope hits per week (mostly token cap, occasional wall clock)
1–2 escalation requests per week, both legitimate
~$120/wk total spend, predictable to within 15%

Without envelopes, spend ranged from $40/wk to $11,400/wk. With envelopes, the worst week we've had was $180. The pattern paid for itself in 36 hours.

Five Anti-Patterns

Each of these has cost a real team real money. Named, so they're easy to call out in code review.

1. The Forever Retry Loop

The agent retries a failing tool call indefinitely. The retry logic doesn't count against the task envelope, or the envelope doesn't exist yet. One bad API gateway day, one transient 502 from a dependency, and the agent burns 4,000 calls before someone notices.

Fix: Cap retries per task to 3. Use exponential backoff. Count retries against the same envelope as everything else — retries are not free, and the budget should reflect that.

2. The Thinking-Mode-Everywhere

Someone shipped a feature flag that turns on extended thinking for every call “just in case it helps.” Cost goes up 3–5x for unclear gains on most tasks (the agent is doing PR triage, not solving the Riemann hypothesis). The improvement is real on hard tasks, but you're paying for it on everything.

Fix: Thinking mode is opt-in per task type, gated by a thinking-budget that's separate from the task envelope. Default off. Document which tasks justify it and require a code-comment justification.

3. The Anthropic-Dashboard-As-Monitor

Your only cost visibility is the Anthropic console. It polls slowly enough that an agent can burn the daily budget before the chart catches up, and it's aggregated across the entire account so you can't tell which task ate it.

Fix: Emit cost events to your own observability layer (see Module 19). Tag every call with task, team, project. Build the “top 10 burners last hour” query yourself. Anthropic's console is a backstop, not the primary monitor.

4. The Per-API-Key Budget

Your only budget unit is the Anthropic API key — team A and team B share a key, so all you see is “the key burned $12k.” You catch the problem only after the org-level limit hits, by which point you're well into next month's budget.

Fix: Per-task envelopes that fail closed. Per-team rollups for reporting. The API key is for authenticating; the budget unit is the task.

5. The After-the-Fact Attribution

Some agent's burned $4k this week and you're trying to figure out whose by parsing application logs after the bill arrives. Half the logs are missing the right structured fields. The other half are missing because the run crashed mid-task. You end up assigning cost based on guesses.

Fix: Tag every call upfront with task, team, project, cause. Persist the tag in the same transaction as the API call's bookkeeping — not in a fire-and-forget log line that might never reach storage. Attribution is a schema constraint, not a logging convention.

The Decision Tree: When NOT to Apply This

Cost envelopes are non-negotiable for autonomous, multi-call agents. They're overkill or counter-productive in three situations:

One-off interactive use — a developer running claude CLI by hand. The human is the budget; envelopes here just add friction. (The org-level Anthropic budget still catches genuine misuse.)
Trusted single-shot calls — a chat-completion endpoint that takes a prompt, returns one response, has no tools, no retries, no loop. The cost is bounded by the prompt and max_tokens; an envelope adds bookkeeping with no upside.
Local-only sandboxes — running against a local model where the “budget” is your laptop's GPU time. Different problem, different solution.

Everywhere else — anything autonomous, anything with a tool loop, anything that runs on a cron, anything that gets triggered by a webhook — envelopes are how you sleep at night.

War Story: The PR-Review Agent That Read One File 4,200 Times

// what an unbounded loop actually costs

True story, anonymized. We shipped an agent that reviewed PRs against a style guide. It used a tool that fetched files from the repo. The tool had a quiet bug: under certain race conditions with our caching layer, it would return a 304 with no body. The agent's interpretation of an empty file was “I haven't read this file yet” — so it asked again. Cache was still warm; 304 again. Loop.

The PR was a small one — 8 files. The agent stopped reviewing the diff and just kept asking for one of the files in particular. It made 4,200 tool calls over 36 hours before someone in the engineering channel mentioned that the bot was taking a really long time on that PR.

Direct cost: $11,400. Time cost: two engineers, half a day each, debugging. Trust cost: the team paused all autonomous-agent rollouts for a sprint.

Cost envelopes wouldn't have caught the bug, but they would have killed the run inside 90 seconds. We'd have spent $0.30 instead of $11,400, and the bug would've been a Jira ticket instead of an incident review.

The thing that makes this story unremarkable is that everyone who runs autonomous agents has a version of it. It's not a sign of bad code or bad agents — it's a sign that the failure surface for autonomous tool-using systems is wider than “is the prompt good.” Envelopes are how you contain the blast radius of bugs you haven't found yet.

What to Build First

If you're starting from zero, the order that actually ships:

Ledger table + middleware (1–2 days). Just the schema, just the per-task token counter, just the hard-stop on cap hit. No tiers yet, no escalation, just one fixed envelope.
Tier definitions (half a day). Pick three tiers, assign each existing task to one. Resist the urge to add a fourth tier; resist harder the urge to skip tiers and treat each task as bespoke.
Attribution headers (half a day). Tag every call. Make the headers required in the middleware — missing tag = call rejected.
Wall-clock cap (half a day). Add the timestamp check. This catches the loops that sip slowly enough that they evade the token cap (yes, this happens).
Escalation path (1 day). Slack webhook + approve/deny/upgrade buttons. Don't bother with a full UI yet; the buttons in chat are enough.
Dashboard (1 day). Top 10 burners last hour. Top 5 teams this week. P99 task spend. Everything else can wait.

That's a week. After that you're spending less time on cost than you were before, and you can stop watching the Anthropic console. The pattern is doing the watching.

// for customer-facing AI features where cost varies per user, see 1.3 Unit Economics for BOT mode.