Spec is the bottleneck, not the model

Once model capability passes a threshold, code quality depends on the completeness of the specification — not the intelligence of the model. A field note from tamer's design log.

2026-05-05 · tamer team · 10 min read

Original article: Your coding agent is under-specified by Hamidreza Saghir (2026-05-02). This post is our response and a description of how tamer is structurally aligned with the thesis.

TL;DR

Once model capability passes a threshold, the limiting factor is spec completeness, not model intelligence.
Better models write code that is more confidently wrong — locally tidy, globally drifting.
Tests catch behaviour, not architectural drift across modules.
Tamer is structurally aligned with this thesis. We treat coding agents as bounded executors of explicit specs, not oracles.
We are not done. Five known gaps below.

The thesis

A recent benchmark from Alibaba — SWE-CI — ran 18 AI agents against 100 real codebases over 233 days and 71 consecutive commits. Over time, more than 75% of agents showed an increasing regression rate. Their changes broke previously-passing tests at an accelerating pace.

The simplest interpretation is "models are bad at long-horizon work." The more uncomfortable interpretation is the one that survives scrutiny:

Once model capability exceeds a certain threshold, code quality depends on the completeness of the specification, not the intelligence of the model.

The model is no longer the bottleneck. The bottleneck moved upstream — into how we describe the work.

Four structural reasons it gets worse over time

1. Code is precise, prompts are vague

"Add authentication" is two words. The corresponding code can be hundreds of lines, encoding decisions about session storage, token rotation, password policy, MFA fallback, account-recovery flow, and rate limiting. Each of those decisions has a wrong answer that ships and a right answer the prompt did not name.

When a prompt is ambiguous, an LLM does not stop and ask. It samples. It fills the gap from training priors — which means it averages the answer over how thousands of other engineers built auth in 2022. That average is not your auth.

A prompt is a compressed program. Code is the decompression. Compression is lossy. The lossiness is silent.

2. Humans don't prompt with code-precision

People write requirements the way they speak. Natural language is permissive: subjects can be implied, adjectives can be omitted, exceptions can be swept under "obviously". Code is not permissive. Every comma matters.

The agent receives the permissive form and emits the strict form. If the strict form is wrong, you find out later — at runtime, in production, when an edge case the prompt didn't name fires for the first time.

The model never says "this instruction is insufficient." It cannot, because to say that it would need a model of what would be sufficient, which is the very thing you didn't tell it. So codebases accumulate as fossil records of guesses.

3. Scale beats fidelity — locally correct, globally incoherent

Suppose you do write a complete spec for the change at hand. You name the function signature, the error cases, the invariant to preserve. The model satisfies all of it. The PR is green. You merge.

Three modules away, an invariant you did not mention has just been violated. The change was correct in its window of attention and silently wrong outside it. Tests on the touched module pass. Tests in the distant module that depended on the invariant pass too — because their assertions never included the invariant explicitly. The bug is not in the code; it is in the shape of the test suite.

This is the failure mode that scales worst. It does not look like a bug. It looks like the system slowly losing internal consistency over many commits. By the time anyone notices, no single commit is at fault, and reverting any one of them does not restore coherence.

4. Models are trained on artifacts, not the engineering process

This one is the deepest. Training corpora capture code. They capture commit messages, but they do not capture:

the design that got rejected at review,
the refactor that was scheduled six months later because of this design,
the incident on a Tuesday in 2024 caused by exactly this kind of shortcut,
the team conversation that ended in "not yet — wait until Q3."

Production engineering is an intertemporal optimisation problem. It is the discipline of refusing the locally-cheapest option because of consequences that show up in twelve months. The training corpus is a flatland: it sees the commit, not the year of context that shaped the commit.

A model trained on artifacts can produce artifacts that look like the corpus. It cannot reliably produce the choice not to ship the artifact. That choice — the negative space of engineering — is invisible in the data.

Tests don't save you

The intuitive defense is "we have tests." But tests verify behavioural correctness against assertions a human wrote. They do not verify architectural quality. They cannot, because architectural quality is about properties that emerge from the interaction of modules — properties that no individual test was written to assert.

Better models do not fix this. They write code that is more confidently wrong: well-named, type-checked, locally consistent, globally drifting. The test suite stays green. The codebase ages anyway.

How tamer is structurally aligned

Tamer is a Master Agent that orchestrates coding agents — Claude Code, Aider, Gemini CLI, Cursor, Cline — with kernel-enforced sandbox + remote human-in-the-loop. We started building it before reading the SWE-CI paper. The design choices below were arrived at independently, but they map onto the four structural reasons above.

We do not claim these are sufficient. We claim they are the right shape of defenses.

1. Work Items have explicit, verifiable Acceptance Criteria

Every assignment is a Work Item (WI) with YAML frontmatter, a scope (the paths the worker is allowed to write), and a list of Acceptance Criteria. Each AC must be verifiable — not "auth works" but "POST /login with bad password returns 401 and increments the rate-limit counter." A worker that cannot point to the AC it satisfied has not finished.

This addresses reason #1 (code precision) by forcing the prompt to be precise before the agent runs.

2. The Master verifies — never accepts the worker's claim

When a worker replies [DONE], the Master does not believe it. The Master reads the diff via the same MCP tools the worker used and checks each AC against the actual code state. If the diff does not satisfy the AC, the WI is bounced back with [REJECTED] and the specific gap.

This addresses reason #2 (silent guesses) by removing the trust window where unverified work accumulates into the codebase.

3. REVIEW pipeline with cross-pollination

Non-trivial WIs are not closed at [DONE]. They enter REVIEW — assigned to a different worker. The reviewer sees the diff cold, without the implementer's mental model. Errors that were locally consistent for the implementer become visible to a fresh reader.

This addresses reason #3 (locally correct, globally incoherent) by adding a second perspective that does not share the first one's blind spots.

4. Persistent memory accumulates non-derivable context

Tamer maintains a per-project memory of facts that are not in the code: who decided what and why, which approaches were rejected, what the team is currently constrained by. This memory survives across sessions and is loaded into every new conversation.

This addresses reason #4 (artifacts, not process) by making rejected designs and team conversations part of the agent's working context — not absent from it.

5. Drift detection and skill-gap detection

Every commit a worker makes is checked: did it touch files outside scope? Did the worker claim ACs that the diff does not satisfy? Both are detected automatically. A worker that drifts gets the workspace reset; a worker that hallucinates progress gets the WI marked [BLOCKED].

This addresses all four reasons collectively, by converting silent failure modes into loud ones.

Five gaps we know we have

We are not done. Five known gaps in tamer today, in priority order:

Architecture invariants registry. A first-class place to declare invariants ("requests cannot be processed before tenant context is bound") that the Master checks before accepting any merge. Today this lives in reviewers' heads.
Spec completeness gate. A pre-assignment lint that blocks WIs with vague ACs ("works correctly") and requires concrete, falsifiable conditions.
Rejected designs log. Persistent memory of design proposals that were considered and rejected, so the next agent does not re-propose them.
Weekly trend metrics. Regression rate over time per project, per agent. SWE-CI showed 75% of agents trend worse; we want to be the platform that measures this on your codebase, not the one that hopes it isn't happening.
WI lint pre-assign. Static analysis of the WI itself: scope coherence, AC verifiability, dependency declarations. Junk WIs produce junk work; the lint catches the junk before the agent does.

This first article is a public commitment to ship these. We will write about each as it lands.

Honest disagreement: spec without taste

The article we are responding to assumes the capability threshold has been reached. For domains where the model fails to reason at all — hard distributed systems, novel cryptography, performance work that requires reasoning about CPU caches — adding more spec does not help. It just produces more elaborately wrong code.

Spec without taste produces well-specified bad code. None of tamer's structural defenses substitutes for engineering taste. Taste is what tells you which invariant to register, which review comment to take seriously, which rejected design to remember. Tamer raises the floor — the worst-case output is more honest, more contained, more reviewable. It does not raise the ceiling.

The team that ships great software with tamer is the team that already had the taste to ship great software. We make their work cheaper. We do not replace the work.

What to do with this

If you are running coding agents on real codebases:

Read your spec like a model would. Where is it ambiguous?
Look at last quarter's PRs. How many were locally correct and globally drifting?
Decide what your invariants are before you let an agent touch them.
Get a second pair of eyes on every non-trivial change. Cheap when the eyes are another agent; expensive when they are an outage.

If you want a tool that bakes (1)–(4) into the workflow, tamer is free to self-host — BSL 1.1, no per-seat fee, no telemetry. The FAQ covers license, OS support, and how tamer compares to Cursor / Cline / Aider.

We ship one of these every two days. Subscribe to the feed if you want them as they land.

Footnote. SWE-CI: Software Engineering Continuous Integration benchmark, Alibaba. 18 AI agents × 100 real codebases × 233 days × 71 commits. The "regression rate increases over time for >75% of agents" is the result we cite throughout. We surface this benchmark via Saghir's Your coding agent is under-specified, which is the article this post responds to. We use it as a prompt for thinking, not a verdict — your mileage will vary by domain, codebase shape, and which agent you run. If you have data that contradicts this on your codebase, open a GitHub issue — we want to know.