What Is an Agent Harness, Really

Every time I open Claude Code and ask it to refactor a module, something quietly remarkable happens. The model reads my files. It searches through my codebase. It edits code, runs tests, checks the output, and circles back if something breaks. It remembers instructions I wrote in a markdown file three sessions ago.

None of that is the language model.

The model — Claude, GPT, whatever — is a text-in, text-out machine. It can't read your filesystem. It can't execute shell commands. It has no memory between sessions. Left on its own, it's a very smart chatbot sitting in a void, waiting for you to paste things into the prompt and copy things out.

So what's doing all the rest?

That's the harness. And understanding it changed how I think about every AI product I use.

The Term Nobody Used Until Recently

I didn't know this had a name until February 2026, when Mitchell Hashimoto — co-founder of HashiCorp — published a blog post about his AI workflow and used the phrase "Engineer the Harness." Days later, OpenAI published a large-scale practical report on their internal harness architecture. Within weeks, the term was everywhere.

The metaphor comes from horse tack — reins, saddle, bit — the complete set of equipment for channeling a horse's power in the right direction, preventing runaways, and enabling stable long-distance operation. It's a good image. The horse is powerful but it needs the harness to be useful.

Multiple people converged on the same definition independently:

Anthropic: "Claude Code serves as the agentic harness around Claude: it provides the tools, context management, and execution environment that turn a language model into a capable coding agent." (docs)
LangChain: "A harness is every piece of code, configuration, and execution logic that isn't the model itself." (blog)
Philipp Schmid: "It is not the agent itself. It is the software system that governs how the agent operates." (blog)

Philipp Schmid offers the clearest analogy I found. The model is the CPU. The context window is RAM. The harness is the operating system. The OS doesn't do the computation — it manages everything around it so the computation is useful.

That's what clicked for me. Two products can use the exact same underlying model, but the one with a better harness — better tool support, better memory, better guardrails — delivers a vastly different experience. The model is table stakes. The harness is where products differentiate.

Five Components, Every Single Time

I kept digging into different harness implementations — Claude Code, Cursor, Devin, OpenHands, various frameworks — and the same five components kept showing up. The implementations differ wildly, but the categories are universal.

1. Tools: How the Model Acts

Without tools, an LLM produces text and nothing else. Tools are the mechanisms that let it touch the real world — read files, execute commands, search the web, edit code.

What surprised me is that fewer tools often work better. Vercel removed 80% of their agent's tools and got better results — fewer steps, fewer tokens, faster responses, higher success rates. Claude Code takes this to heart: it uses a small set of primitive capabilities — Read, Write, Bash, and MCP connections — rather than hundreds of specialized integrations. Bash alone is what Anthropic calls "the universal adapter." If a human can do it from the terminal, the agent can too.

The design tension is granularity. Too many tools overwhelm the model with choices. Too few and it can't do specific tasks well. The production pattern I see converging: start with powerful primitives, add specialized tools only when primitives consistently fail at something, and dynamically scope which tools are available based on the task phase.

2. Permissions: What the Model Cannot Do

This is the part I underestimated. Every tool is a capability, and every capability is a risk surface. The permission layer decides what the agent is allowed to do, and when it needs to ask.

Claude Code implements tiered trust: Plan mode is read-only. Default mode asks before every file edit and shell command. Auto-accept mode lets file edits through but still gates shell commands. Auto mode (still in research preview, as of March 2026) runs background safety checks on everything.

The OpenDev paper describes defense-in-depth with five independent safety layers — prompt-level guardrails, schema-level restrictions, runtime approval, tool-level validation, and lifecycle hooks. The strongest pattern across all implementations: subagent isolation, where specialized agents never see tool definitions they cannot use. The LLM literally doesn't know the dangerous tools exist.

Constraints increase reliability. Every team that succeeded did so by restricting what agents could do, not by giving them more freedom. That's counterintuitive but consistent across every source I read.

3. Context Management: The Central Constraint

This is the one that made me realize harness engineering is actually systems engineering. The context window is finite — every instruction, every tool definition, every conversation turn, every file the model reads occupies space. Managing that space is the single biggest architectural driver.

Strategies I found across production systems:

Auto-compaction: When approaching context limits, summarize older turns into concise records. Claude Code triggers this at around 50% capacity.
Sub-agent isolation: Offload work to separate agents with their own context windows. Only summaries return to the parent. This is how Claude Code prevents context bloat on complex tasks.
Deferred tool loading: Don't inject tool definitions until needed. Claude Code loads MCP tool names by default but only sends full schemas when the model actually uses one. Critical when connecting dozens of external services.
Persistent instruction files: Put important rules in CLAUDE.md rather than conversation. Those files survive compaction; conversation history does not.

One finding from Anthropic that blew my mind a little: they discovered "context anxiety" — models start wrapping up tasks prematurely as they approach perceived context limits. The fix wasn't better summarization. It was full context resets with structured handoffs. Clear the slate, give the model a clean brief of where things stand, and let it continue fresh.

4. Execution Loop: Think, Act, Observe, Repeat

At the core of every harness is a loop. The simplest version: call the model with available tools, let it decide what to do, execute the tool call, feed the result back, repeat until the model says it's done.

That's it. The orchestrator doesn't need to understand code or files — it just runs the loop and lets the model decide when to stop.

Claude Code structures this as three blending phases: gather context, take action, verify results. The user is part of the loop — you can interrupt at any point to steer. More sophisticated implementations add pre-checks, self-critique, and verification steps, but the fundamental pattern is always Think-Act-Observe-Repeat.

5. Memory: Persistence Across Sessions

A model without memory starts from zero every time you open a new session. The harness adds persistence.

Claude Code implements a six-layer memory system: organization policies, project-level conventions (CLAUDE.md), user preferences, session context, auto-learned patterns (MEMORY.md), and task tracking. At session start, these layers load in order, giving the model a full picture before the first message.

This is the least glamorous component and the one that matters most in practice. The difference between an agent that remembers your coding conventions and one that reformats your code every session is entirely about memory design.

Dissecting Claude Code: A Worked Example

I use Claude Code daily, so I wanted to dissect it as a concrete case. It's also the best-documented production harness I've found — Anthropic publishes both official architecture docs and engineering blog posts about design decisions.

The design philosophy is captured in one line from their docs: "One main loop, simple search, simple todolist. Resist the urge to over-engineer, build good harness for the model and let it cook."

Three principles stand out:

Primitive tools over specialized integrations. Rather than building a dedicated "refactor function" tool, a "run test" tool, a "search codebase" tool — Claude Code gives the model Read, Write, Bash, Grep, Glob, and an Agent tool for delegation. Everything else composes from these. Need to run tests? Bash. Need to check types? Bash. Need to search for all callers of a function? Grep. This keeps the tool set small and the model's decision space manageable.

Build to delete. This is the most counterintuitive principle. Every piece of hard-coded logic is a liability when the next model ships. Anthropic found that improvements in Opus 4.6 made sprint decomposition unnecessary — so they deleted it. The architecture is designed to shrink over time. Harness components should be modular enough to rip out when the model outgrows them.

Sub-agents for context isolation. When Claude Code needs to explore a codebase, it spawns an Explore sub-agent — a cheap model with read-only access running in its own context window. For planning, a full-power model with read-only access. For complex sub-tasks, a general-purpose sub-agent with all tools. Each runs independently; only summaries return to the parent. This is how one session can handle a task that would blow any single context window.

The extensibility model layers on top: Skills (reusable instruction bundles), Hooks (deterministic scripts at lifecycle events), MCP (standard protocol for external services), and Agent Teams (multiple instances coordinating via shared task lists). But underneath, it's always the same five components: tools, permissions, context, loop, memory.

Same Model, Different Harness, Different Product

Here's what drove the point home for me. Multiple products use the same Claude or GPT models but produce completely different experiences. The difference is the harness.

Cursor bets on context quality. When you open a project, Cursor analyzes every file, splits them into chunks, computes embedding vectors, and stores them in a vector database. This means the agent can find semantically relevant code without writing search queries — the harness does the retrieval. Their proprietary Composer model was trained via reinforcement learning inside real codebases. It's an IDE-first harness: the tools are tightly integrated with the editor, not the terminal.

Devin bets on autonomy. Each Devin instance gets a full isolated virtual machine — browser, terminal, code editor — in a cloud sandbox. You assign a task and walk away. As of March 2026, Devin can orchestrate other Devins in parallel, each in its own VM. Their reported stat: 67% of PRs merged, up from 34% a year ago. The tradeoff: maximum autonomy at the cost of human oversight. You can't interrupt mid-thought like you can with Claude Code.

OpenHands bets on openness. Their architecture uses an event-sourced state model — every action and observation is an immutable event appended to a log, enabling deterministic replay. It's open-source (64k+ GitHub stars as of March 2026), with a composable SDK and MCP integration. If you want to build your own harness on top of a well-tested foundation, this is where to start.

Frameworks like LangChain/LangGraph and Microsoft's Agent Framework take a different approach entirely — they provide building blocks rather than complete products. You assemble the five components yourself. Maximum flexibility, but you own the integration work.

Same model, different harness, different product. That's the thesis in action.

MCP: The Universal Plug

One development worth calling out: the Model Context Protocol (MCP) is standardizing the tool layer across all harnesses.

Before MCP, every harness had to build custom integrations for every external service — a GitHub integration, a database integration, a Slack integration, each built from scratch. MCP creates a universal protocol: build a tool server once, connect it to any MCP-compatible harness.

The adoption trajectory has been fast. Anthropic introduced MCP in November 2024. OpenAI adopted it by March 2025. Anthropic donated it to the Linux Foundation in December 2025, co-founded with Block and OpenAI. As of early 2026: 10,000+ active MCP servers, 97 million monthly SDK downloads. OpenAI is sunsetting the Assistants API in mid-2026, pushing everyone toward MCP-based architectures.

For harness design, MCP changes three things: the tool layer becomes pluggable (connect to servers, don't build integrations), context is optimized (deferred loading means only tool names consume context until used), and tools become composable across products (build once, use everywhere).

If you're building a harness today, MCP is the tool layer. That question is basically settled.

Building Your Own: Start With 50 Lines

This is the part I found most encouraging. A minimal harness is not complicated.

Simon Willison puts it simply: "A simple tool loop can be achieved with a few dozen lines of code on top of an existing LLM API." The core is a while loop:

while True:
    response = llm.chat(messages, tools=tool_schemas)
    if no tool_calls in response:
        break
    for call in response.tool_calls:
        result = execute_tool(call)
        messages.append(tool_result(result))

That's a harness. Not a good one, but a real one. From there, you add layers as real failures demand them: permissions (validate before executing), output truncation (don't blow the context window), compaction (summarize old turns), memory (load instructions from files), and eventually sub-agents (delegate to isolated contexts).

Mitchell Hashimoto's practical method resonated most with me. Two strategies: first, an AGENTS.md file documenting behavioral mistakes and their solutions — "each line in that file is based on a bad agent behavior, and it almost completely resolved them all." Second, programmed tools that prevent specific failure modes.

The anti-pattern he flags: "Pre-designing ideal configurations before real failures occur is the #1 anti-pattern." Start with the bare loop. Deploy it. Watch what breaks. Fix that specific thing. Repeat.

That's harness engineering at its most honest — not building a framework, but iteratively solving real problems.

Where This Is Heading

Two trends I see accelerating:

Multi-agent composition. Single-agent harnesses hit context and complexity limits. The solution: agents that spawn other agents, coordinate through shared task lists, or hand off to specialists. Claude Code already does this natively. Devin orchestrates fleets of Devins. Anthropic's three-agent system — Planner, Generator, Evaluator — separates doing from judging, which turns out to be far more tractable than making a generator critical of its own work.

The discipline is formalizing. Martin Fowler's Thoughtworks team calls it "harness engineering." Academic papers are formalizing the architecture. MCP is standardizing the tool interface. We're watching a new engineering discipline crystallize — and it's moving fast.

The shift for developers is real. Multi-agent orchestration moves the role from "write code" to "decompose work, write clear specs, verify output." Understanding the harness is step one of that transition.

What I Know Now

Here's where I've landed.

The agent harness is the most consequential layer in AI systems today, and it's the one nobody talks about. Every time you use Claude Code, Cursor, Devin, or any AI coding tool, you're experiencing the harness more than the model. The tools, the permissions, the context management, the execution loop, the memory — that's what makes the interaction work.

The model provides intelligence. The harness makes that intelligence practical.

If you're building with AI agents, understanding the harness pattern — the five components, the design tradeoffs, the build-to-delete philosophy — gives you a framework for evaluating every product and building your own.

That's what I've pieced together. Hoping it saves you a few rabbit holes.