The Harness Is the Product

The useful layer around coding agents is becoming the harness: the environment that lets an agent do work, check itself, recover from failure, and hand something reviewable back to a human or another system. The model is still important, but the repeatable advantage is moving into the surrounding machinery.

You can see the same pattern in a lot of recent agentic development work:

security workflows packaged as agent tools;
review swarms that split behavioral regression, security, performance, and test coverage into separate checks;
teams arguing that evals should come before product specs, not after implementation;
code review moving from one human opinion at the end toward gates, tests, rubrics, and multi-model critique during the work;
skills turning repeated expert judgment into portable procedures an agent can invoke on demand.

The common thread is not “better prompting.” It is a shift from conversational coding to operational coding.

Prompts Do Not Hold Shape

A prompt can start work, but it does not reliably preserve standards across a long task.

Long agent runs drift. They fill context with stale details. They overfit to the last error. They forget why a constraint existed. They take shortcuts if the environment lets them. A stronger prompt helps, but it is still just text in a session.

A harness gives the work shape outside the conversation:

source control boundaries;
tests and lint commands;
structured review prompts;
app-specific skills;
simulator or browser automation;
permission rules;
known failure-mode checks;
artifact requirements;
handoff formats.

That outside structure is what lets the agent operate longer without silently turning the codebase into a pile of plausible changes.

The Best Agent Workflows Are Not Single-Agent Workflows

The more serious patterns split responsibility.

One agent implements. Another reviews. Another checks security. Another inspects test coverage. A local script enforces formatting. CI decides whether the branch is real. A human reviews the summary and the diff, not a transcript of every token.

This is not delegation for its own sake. It is accountability.

Subagents are useful when they create separate lines of scrutiny. They are much less useful when they are just more parallel enthusiasm pointed at the same vague task.

A good harness answers:

What is the agent allowed to touch?
What evidence proves the work is done?
Which checks are mandatory?
What should happen when checks fail?
Which risks need independent review?
What artifact should a human inspect?

Without those answers, “use more agents” mostly creates more output.

Skills Are Procedures With Distribution

Skills are one of the cleaner packaging formats for harness work.

A good skill is not just a prompt. It is a compact procedure: when to activate, what context to gather, what tools to use, what standards to enforce, and what result to produce.

That matters because teams do not need one perfect prompt for every task. They need reusable procedures for recurring work:

generate project documentation;
audit frontend quality;
review SwiftUI structure;
produce App Store screenshots;
threat-model a change;
turn a feature idea into a testable plan;
clean up agent-introduced complexity.

Once a procedure is packaged, it can be improved, shared, tested, and reused. The knowledge stops living only in someone’s chat history.

Review Is Becoming A Gate, Not A Comment Thread

Human review is still necessary, but it should not be the first serious verification step.

Agent-produced code changes the economics of review. The volume goes up. The confidence of the prose is not evidence. The diff may be large, mechanically consistent, and wrong in a subtle way.

The useful response is not to read more carefully forever. It is to put more checks before the human:

automated tests;
type checks;
lint and formatting;
security scans;
browser or simulator walkthroughs;
multi-pass code review;
explicit regression checks;
short human-readable summaries of what changed and what was verified.

The human reviewer should be judging evidence, not discovering from scratch whether the agent remembered to run the obvious checks.

What This Means Practically

If you are building with agents, the question is not just “which model should we use?”

Ask instead:

What tasks do we repeat often enough to package as skills?
What checks should always run before a change is considered done?
Where do agents keep making the same class of mistake?
Which parts of the workflow need isolated review?
What context should be retrieved fresh instead of trusted from memory?
What permissions should agents never have by default?

That is the work.

The frontier will keep moving. Models will get better. Editor integrations will change. But the compounding advantage is in the harness: the pieces of process, context, verification, and tooling that make each agent run more reliable than the last one.

The model can be rented.

The harness is what you own.

Prompts Do Not Hold Shape

The Best Agent Workflows Are Not Single-Agent Workflows

Skills Are Procedures With Distribution

Review Is Becoming A Gate, Not A Comment Thread

What This Means Practically

Further reading

More notes.

Close the Loop

Agent-first CLI design