Harness Engineering for AI Agents · The Harness Thesis

The Harness Thesis

12 min read

By the end of this reading you will be able to:

State the harness thesis and explain the empirical evidence from four randomized experiments that supports it
Distinguish between harness-side and model-side sources of quality improvement and estimate their relative magnitudes
Apply the seven harness engineering design principles to evaluate a proposed agentic system design

The Central Question

Can open-source models running locally approach the utility of frontier models through harness engineering alone?

The question sounds like the kind of thing people debate on forums. In this course, it is an empirical question with a quantified answer drawn from four controlled experiments and approximately 1,500 logged runs.

The answer: yes, for a large class of knowledge synthesis tasks. But the mechanism is not what most practitioners expect.

Why People Get This Wrong

The default mental model of agentic AI quality goes something like:

quality ≈ f(model_capability)

Under this model, the path to better agents is always the same: upgrade the model. Pay for GPT-4o instead of GPT-3.5. Wait for the next release.

This mental model is wrong in a precise way. Across four randomized experiments covering 36 controlled runs:

Upgrading the producer model (7B → 32B) improved mean composite score by ~~0.9 points on a 0–10 scale (~~10–12%).
Upgrading the harness — adding a count constraint, a rubric-based evaluation loop, and task-type-specific criteria — improved mean composite score by ~~1.8 points (~~20–25%).
Upgrading the evaluator model (9B → 30B) improved score by ~~1.2 points (~~15%).

The harness changes — changes to the pipeline, not the model — produced the largest single improvement. And unlike model upgrades, harness changes are free, local, version-controllable, and composable.

The revised mental model:

quality ≈ f(harness_design) + g(model_capability)

where the harness term dominates.

What Is the Harness?

The harness is everything that is not the model. Seven subsystems — one word:

	Subsystem	What it does	Module
H	Hooks	Skills injected at four pipeline points: pre_research, pre_synthesis, post_synthesis, post_wiggum	M4
A	Agent	The orchestration pipeline that sequences every other subsystem	M1, M4
R	Research loop	Saturation-gated search that controls what information reaches the model	M2
N	Notes	Persistent memory — observations compressed and stored after each run	M2
E	Evaluation loop	The Wiggum evaluate → revise → verify cycle that enforces quality	M3
S	Security	AST scanning, path sandboxing, and prompt injection detection	M4
S	Signals	Structured tracing and logging that make every run inspectable and improvable	M4, M5

The model is a commodity input. Given the same inputs, a stronger model produces better outputs — but the harness controls the inputs. A weak model with excellent context beats a strong model with poor context, every time.

The Design Principles

Six principles govern every decision in the harness:

1. Build for deletion. Every workaround that compensates for a model limitation should be trivially removable when models improve. Don't bake brittleness in. If you write a post-processing step that strips malformed JSON from model output, mark it clearly as a temporary workaround, not a permanent fixture.

2. Verify externally at every stage boundary. Model self-report is not verification. If the model says it wrote a file, check with os.path.exists(). If it says there are 15 items in its output, count them with Python. If it says the output is 500 tokens, read the usage field in the response metadata. Never accept a model's account of its own behavior as ground truth.

3. Add observability before adding features. Structured traces before new tools. If you add a pipeline stage but can't measure its latency or token cost, you're flying blind. Every stage boundary should emit a trace span. Token accounting should happen at the infrastructure level, not as an afterthought.

4. Start with the simplest pattern that meets the requirement. Single agent before orchestration. Fixed search before saturation gating. Direct output before revision loops. Complexity has carrying costs — debug surface, latency, token cost. Add it only when the simpler version demonstrably fails.

5. Evaluator and producer must be different models. Same-model evaluation is circular. A model cannot reliably score its own output; it will exhibit the same systematic biases in both generation and evaluation. The evaluator should be from a different family or a significantly larger model than the producer.

6. The harness is the product; the model is a commodity input. Reliability, reproducibility, and correctness live in the harness, not in any particular model. When a model is deprecated or a better one is released, swapping it should be a configuration change, not a rewrite.

7. Telemetry is what separates a critic from a scorer. A scorer reads an output and returns a number. A critic reads an output and returns a number with reasons — logged at the stage level, timestamped, indexed by run ID. Structured per-dimension scores don't just confirm that a run scored 7.3 — they tell you which dimension pulled the score down, in which revision round, at what latency. Without that structure, you can detect that quality fell; you cannot identify where in the pipeline the fall originated or what to change next. Telemetry turns the evaluation loop from a gate into an instrument.

The Implication for Practice

If you are building an agentic system today, the highest-leverage thing you can do is:

Measure output quality rigorously (you cannot improve what you do not measure)
Identify which stage of the pipeline produces the most variance
Fix that stage first

Most teams skip step 1 and spend their improvement budget on model upgrades. This course gives you the tools to do all three.

A Note on Scope

The harness in this course is designed for knowledge synthesis tasks: research, summarization, analysis, document generation, literature review. The experiments cover tasks like "find the top 5 context engineering techniques" and "identify the most common failure modes in multi-agent systems."

For different task classes — code generation with execution, tool-augmented tasks, multi-turn conversation — some patterns transfer directly (verification, memory, observability) and some require adaptation (the rubric, the search loop). The principles are general; the specific implementation is tuned for synthesis.

References

Karpathy 2025 — Software 3.0 — LLMs as the new operating system

Previous Next →

The Harness Thesis

The Central Question

Why People Get This Wrong

What Is the Harness?

The Design Principles

The Implication for Practice

A Note on Scope

Privacy Policy

What we collect

What we don't collect

Your choices

Contact