Notes on harness engineering

Forward deployed engineer (FDE) was not really a role a few years ago. Now every frontier lab has them.

My prediction: harness engineer is next (at least until AGI comes).

I have been obsessed with this for the last few weeks, getting my hands dirty with a bunch of these harnesses. Some notes.

Getting the best performance out of an AI model for agent work comes from two places. Good agent engineering is one. Building a better harness for your specific model is another.

What is a harness

I overheard this in an OpenAI engineer’s talk. Going forward, you need to shift your framing a bit. From “how do I use AI to write better code” to “how do I help AI write better code.” The harness is one thing that helps answer the second. Reread that one. It took me a while to fully internalize when he said that in the talk.

Reread that one. Took me a while to fully internalize.

A great harness helps AI do better work.

“Code” here is a stand-in. Replace it with whatever knowledge work you want the agent to do.

First-party vs third-party

There are a few ways to slice harnesses. Let’s start with first-party vs third-party.

Third-party harnesses are model-agnostic. Example: Copilot SDK/CLI supports different models. The Responses API lets you run non-OpenAI providers under Codex. Useful if you want to swap providers.

But!

The real performance comes from running a model inside its first-party harness. Claude inside Claude Code. GPT-5.x inside Codex.

The harness is the gym

Why? Here is the fine print. A lot of the post-training that makes these models good at agent work happens inside the provider’s own harness. The harness is the gym.

You can see this directly in the system prompts. For example:

Codex tells its model: “prefer using rg or rg --files because rg is much faster than alternatives like grep.”
Claude Code tells its model the opposite: “Content search: Use Grep (NOT grep or rg).”

Same task, opposite philosophy. In general: Codex routes through the shell. Claude Code routes through dedicated tools.

Each model was trained to be good inside its own shape. Pull one out of its gym and drop it into a third-party harness, and it will still work. But you are getting a model outside the gym it was trained and optimized for.

The builder take

Today, imo, the leading harnesses are Codex and the Anthropic SDK. Copilot SDK is not bad either. I have used all three. Dobby, my personal assistant, performs best on first-party.

So what does all this yapping amount to. If you are building a harness, build. But pick your shape.

If you are committed to one model, build on top of its first-party harness. That is where the performance lives. The trap is reaching for a third-party agnostic harness by default. You trade real performance for optionality you will probably never use.
On picking a vertical. General knowledge work is a huge vertical, and the big providers are going to eat it. Look at their release pace. It is insane. Any general-purpose thing you build is six months from being absorbed into the next Codex or Claude Code release. The move is to pick a vertical niche enough that the Eye of Sauron does not swing your way. That is where a small team can still build something that sticks.

My goal this year: ship one harness on top of a first-party platform, for one specific vertical, open source, a hundred GitHub stars by end of year. I am very likely to fail.

More as I keep building.