Anatomy of a [DONE] event that never arrived
Four levels of root cause for why an agent that finishes its work stays invisible.
TL;DR
- A worker finishes its job and commits the change. The master never sees it.
- Three wrong hypotheses before the right one: "noisy stream," "preemptive 2," "detection-based fix."
- The actual root cause was three layers deeper: every MCP tool call in Claude Code is gated on first use by an interactive "Do you want to proceed?" prompt. The worker sat at that prompt forever.
- The fix lives in the worker's bootstrap, not in the master's listening logic. Knowing where the cause lives is half the work.
The symptom
A worker takes a Work Item, executes it, makes the commit, and replies to the master via reply_to_master(...). From the master's side: silence. For hours.
We knew the work was done. We could see the commit on disk. The master had no idea.
This is the story of what we thought was wrong, and how we found what was actually wrong.
Layer 1 — "I missed the event in the noise"
First instinct: the master's event stream is dropping worker_reply events. The fix that suggests itself is to poll: have the master periodically read the worker's git state, notice that a commit landed, and reconstruct the missing event from the diff.
We tried it. It works as a workaround — the master eventually catches up. But polling git is treating the symptom. The event was supposed to arrive, and it didn't, and we hadn't yet asked why. Polling gave us a system that limped instead of one that worked.
We backed it out. A workaround that obscures the underlying defect makes the next bug harder to find.
Layer 2 — Preemptive "2"
Next hypothesis: the worker was sitting at an approval prompt of some kind, and a single keystroke would dismiss it. Specifically, Claude Code's tool-use prompts default to option 2 ("yes, and don't ask again for this tool in this session"). If we send 2 to the worker the moment we connect, we should pre-dismiss whatever prompt was blocking it.
We sent 2 on connection. The worker, which was not at a prompt, received 2 as a chat message and replied: "I see '2' but no question."
Wrong on two counts. First, sending 2 to an idle worker isn't a no-op — it's a chat message that pollutes the conversation. Second, the timing was wrong: the prompt didn't exist yet at the moment we sent the keystroke, because the prompt is triggered by the worker's first MCP call, not by the master's connection.
False turn, tested, abandoned.
Layer 3 — Detection-based fix
Third hypothesis: instead of blindly sending 2, detect the prompt and respond to it. We built a watcher that polls agent_wait snapshots looking for the pattern Tool-use plus Do you want to proceed?. When the pattern fires, the watcher injects 2 into the worker's stdin.
This works. The worker gets unblocked, the master receives the event, the WI closes.
But every first MCP call still pays a 30-second-plus tax before the watcher fires. And — more important — the fix is reactive. We were reading prompts off the screen and typing replies to them. That is exactly what a human at the keyboard would do. Whatever it was, it was not structural.
A reactive fix that works masquerades as a structural fix. The test suite goes green and you stop digging. We almost did.
Layer 4 — The actual root cause
We kept digging. The structural fact is this: Claude Code gates every MCP tool call at first usage with an interactive "Do you want to proceed?" prompt. The worker types reply_to_master(...). Claude Code says "this tool is from an MCP server you haven't approved in this session — proceed?" The worker is now blocked on a prompt nobody is going to answer. The MCP call never executes. The master never sees a thing.
That is why polling git "fixed" it (the commit was on disk regardless), and why detecting the prompt "fixed" it (we were typing the answer the worker could not), and why preemptive 2 didn't (the prompt didn't exist yet when we tried).
The bug is not in the master's listening logic. It is not in the relay. It is not in reply_to_master. It is in the worker's bootstrap policy: MCP tools are gated by default, and a fresh worker hits the gate on its first call.
The fix
Pre-allowlist mcp__tamer-worker__* in the worker's .claude/settings.json during tamer init. The prompt never appears, the call goes through on first use, no interactive gate exists to stall the worker.
The allowlist block is a few lines in the settings template (templates/claude-settings.json:28-32). The init step now writes that block into every worker's .claude/ directory. The behaviour is covered by tests/init/init-mcp-allowlist.test.ts, 4/4 passing. The fix landed in tamer-cli at 0a322f9, with the related TODO sweep at 42e2289.
That's it. Five lines of configuration in the right file.
Where the fix lives matters
Notice where the repair is not: it is not in the master orchestrator, not in the event router, not in reply_to_master. It is in the worker's bootstrap. We had spent days assuming the bug was in the listening side because the symptom was on the listening side.
Half of debugging is locating where the cause lives. Once you know, the actual fix is often trivial. We wrote about an adjacent failure mode in Spec is the bottleneck, not the model — code that is locally consistent and globally drifting. This was the same shape of mistake applied to debugging: every layer was locally consistent. The symptom matched the hypothesis, the workaround worked, the metric went green. The cause was three layers deeper than the layer we were patching.
Meta-lesson
Three wrong hypotheses before the right one. None of them were stupid; each one had evidence behind it. The lesson is not "be smarter," because we were each time genuinely solving the version of the problem we thought we had.
The lesson is: keep digging even when it seems fixed. A reactive fix that works masquerades as a structural fix until you find what it was masking. If you only stop when the test goes green, you stop at Layer 3 forever.
The harder lesson is to trust where the cause lives. We kept editing the master because the symptom was visible there. The cause was in a config file in the worker. The shortest distance between symptom and cause is rarely a straight line.
Closing
If you are running coding agents and seeing similar "completed work that never reports back" symptoms — the worker did the thing, you can see it on disk, but the orchestrator is blind — check whether your agent has interactive prompts gating its tool calls. Pre-allowlist the tools at bootstrap. An interactive prompt is not your friend inside an unattended worker.
See how tamer silently observes its workers — without typing 2 at any prompts — in the FAQ.
We ship one of these every two days. Subscribe to the feed if you want them as they land.
Footnote. Commit references: tamer-cli 0a322f9 (the Layer-4 fix — settings template plus init wire-up) and 42e2289 (TODO sweep tracking the resolved follow-ups). Test coverage: tests/init/init-mcp-allowlist.test.ts, 4/4 passing on the current main. Settings template anchor: templates/claude-settings.json:28-32 is the allowlist block written to every fresh worker's .claude/settings.json at tamer init time. If your worker is a different agent — Aider, Cursor, Cline — the equivalent gate exists under a different name, but the fix shape is the same: pre-allow at bootstrap, do not negotiate the prompt at runtime.