Building tamer's spec-completeness v2 — a 5-axis study

Five axes, five iterations each. How we turned 'is this spec good enough?' from a vibe into a checklist that a Master Agent can refuse a Work Item against.

2026-05-13 · tamer team · 6 min read

TL;DR

v1 had two axes: scope and acceptance criteria. It caught vague WIs about half the time.
v2 has five. The three new ones — invariants, dependencies, non-goals — were the axes that explained the v1 escapes we examined: WIs that shipped clean and were later flagged by reviewers as locally correct, globally drifting (roughly 20–40 WIs across the v1 → v2 forensic pass).
Each axis went through five iterations: heuristic → tighten → counterexample → relax → final form. The relax step matters: a gate that refuses everything is the same as no gate.
The gate ships today as a lint at WI-assign time. The companion metric — falsifiable ACs per project, per week — is the next post.

Why measure the spec at all

Spec is the bottleneck, not the model argued that, past a capability threshold, code quality is bounded by how completely the work is described. Anatomy of a [DONE] event that never arrived added: when an agent fails, the cause is rarely where the symptom is. Both end with the same plea — measure the spec before you measure the code. This is the first time we publish how.

v1 of the gate was six lines: scope present, ≥1 AC, ACs contain a verb. The real gate was the reviewer queue — a person, expensive, slow. We wanted the gate to be code.

Five is not a magic number. It's what survived five iterations each without collapsing back into vibes. Anything we tried as a sixth (priority, owner, deadline) was project-management metadata, not spec-completeness, and we kept it separate.

Axis 1 — Scope

Question: which paths is the worker allowed to write, read but not write, or touch at all?

v1 → v2: a single scope.paths list became three (write, read_only, forbidden) plus an explicit extend_scope: false. A worker that thinks it needs more replies [WAIT] Scope extension request and waits. No silent expansion.

Bad: scope: [src/] — whole tree to flail in; 50-file diffs for 3-line changes. Good: write: [src/auth/login.ts, tests/auth/login.test.ts] · read_only: [src/auth/, src/middleware/] · forbidden: [src/billing/]. Any touched path outside write is a visible violation.

Cheapest axis, kills the most surface bloat (mechanism 3 in why your AI agent is silently degrading your codebase). Refuses 30% of submitted WIs; we have never regretted one.

Axis 2 — Acceptance criteria, falsifiable form

Question: can a third party, reading only the AC and the diff, decide whether the WI is done?

v1 → v2: free-text ACs became "≥1 observable predicate per WI." The gate greps for returns, emits, writes, logs, responds with, fails with, commits, tests pass, =, →. ACs matching none get bounced.

Bad: AC-1: auth works. "Works" is a vibe. Good: AC-1: POST /login with wrong password returns 401 and increments auth.rate_limit.{ip}; failing test in tests/auth/login.test.ts goes red→green. A reviewer decides in 30 seconds.

This took the most iterations. Iteration 4 was a strict regex that refused half the WIs we wanted to ship. Iteration 5 relaxed it to "one falsifiable predicate per WI, not per AC" — enough signal, low enough false-positive rate to live with.

Axis 3 — Invariants

Question: which global properties must remain true after the change, even though the change isn't about them?

v1 → v2: didn't exist. v2 adds an optional invariants: [...] block. Empty is allowed (most small WIs don't touch invariants), but the gate logs WIs that touch >5 files with an empty invariants and flags the next review.

Bad: a refactor WI with no invariants block, touching 12 files. Good: invariants: [{ "tenant_id is set on every outbound request": "data leaks across tenants if violated" }, { "no synchronous DB call inside the websocket handler": "p99 latency regresses" }]. The reviewer scans the diff for these two properties before reading anything else.

This axis addresses locally correct, globally drifting directly. It's the one we get most wrong — invariants live in senior heads, and pulling them out is hard. We treat the empty-block flag as a research signal, not a failure.

Axis 4 — Dependencies and preconditions

Question: what must already be true before the worker starts?

v1 → v2: depends_on: [] existed but was unread. v2 reads it: if depends_on references a WI that is not done, the worker refuses to start. If a WI declares no dependencies but mentions a component covered by another in-progress WI, the gate raises a soft warning.

Bad: a "wire up the new auth middleware to /admin" WI with no dependency on the prior "implement auth middleware" WI. The worker started, found no middleware, hallucinated one inline, broke /admin for everyone. Good: depends_on: [WI-AUTH-IMPL] — the worker doesn't start until WI-AUTH-IMPL is done.

Structural insurance against the [DONE]-event-that-never-arrived class of bug: an agent that can't begin work it isn't ready for cannot get stuck mid-flight.

Axis 5 — Non-goals and failure modes

Question: what is the worker explicitly not trying to do?

v1 → v2: didn't exist. v2 adds non_goals: [...] and accepted_failures: [...]. Both optional, strongly encouraged for any WI larger than 50 lines of expected diff.

Bad: "add caching to the user-profile endpoint." The worker reasonably reads in cache invalidation, cache warming, a metrics hook, and a feature flag too. Diff: 600 lines. Good: same WI plus non_goals: [cache invalidation, metrics, feature flag] · accepted_failures: [stale data up to 60s is acceptable]. Diff: 40 lines. Everything left out is its own WI later, on purpose.

Non-goals fight reward hacking (mechanism 1 in article 2) head-on. An agent that knows what it is not supposed to do has fewer cheap green-tick shortcuts.

What this is and what it isn't

Five axes, five iterations each, ~1 month of internal use. The gate refuses ~40% of incoming WIs. Refusal rate by axis: scope 30%, verifiability 45%, invariants 10% strict + 30% soft, dependencies 5%, non-goals 15% (sum > 100% because the worst WIs fail on multiple).

Two known gaps. The invariant axis still needs senior judgement to seed — no way yet to mine invariants from existing CI rules. The verifiability regex is shallow; an AC that says "tests pass" is technically compliant and semantically empty. Iteration 6 will probably attack that with a real verb-plus-object check.

If you run AI coding agents and the reviewer queue is the only thing keeping it sane, tamer is free to self-host — BSL 1.1, no per-seat fee, no telemetry. The gate ships as part of the Master role. The FAQ covers WI shape, refusal codes, and how to extend the axes for your stack.

We ship one of these every two days. Subscribe to the feed if you want them as they land.

Footnote. Same source-caveat pattern as prior pieces. The five-axis structure is a tamer design choice, not a borrowed framework — not derived from a published study, not claimed to be exhaustive or optimal outside our own codebases. SWE-CI (referenced in article 1 and article 2) is the prompt for caring about spec completeness at all; we have not independently verified its 75% headline and treat it as order-of-magnitude. The refusal percentages above come from our internal log over roughly a month of v2 use across a small set of projects — directionally honest, not statistically powered. If your breakdown looks different, open a GitHub issue — we want to recalibrate against external data.