Harness Engineering for AI Agents · The Harness Thesis

Before You Start: Prerequisites & Learning Path

8 min read

By the end of this reading you will be able to:

Identify the technical prerequisites for this course and assess your own readiness
Describe the five-module arc of the course and what you will be able to build by the end

What This Course Is About

This course teaches you to build reliable, production-grade agentic AI systems using open-source models running locally — no API keys, no per-token billing, no black-box model locks.

The central thesis is simple but counterintuitive: the harness matters more than the model. Swapping from a 7B to a 32B producer shifts quality by 10–15%. Fixing a fundamental harness flaw — missing verification, no memory, naive search — can shift quality by 50–80%. The four randomized experiments that underpin this course demonstrate this empirically.

By the end of this course you will be able to:

Design and run a rigorous experiment to evaluate harness changes
Build a saturation-gated research loop that stops searching when new results stop adding information
Implement an evaluate → revise → verify loop with a decimalized rubric
Wire a custom skill into the pipeline at any of four hook points
Instrument any pipeline stage with structured tracing
Build a DPO preference dataset from your own run logs and fine-tune a domain-specific model

Prerequisites

Required:

Python fluency — you should be comfortable reading and modifying 500–1,000 line Python scripts. The harness is not a library you install; it is code you read, understand, and adapt.
Basic ML literacy — you should know what a language model is, what fine-tuning means at a high level, and what a context window is. You do not need deep ML theory.
Command-line comfort — installing Ollama models, running Python scripts, reading log output. The course runs everything locally; there is no managed infrastructure.

Helpful but not required:

Experience with any LLM API (OpenAI, Anthropic, etc.) — the mental model transfers even though the infrastructure differs
Familiarity with SQLite and/or vector databases — memory.py uses both, and knowing what they are makes the Memory Systems reading click faster
Basic familiarity with PyTorch or transformers — useful for the QLoRA fine-tuning reading in Module 5

Not required:

Frontier model API access (the course runs on Ollama with local models)
A GPU (though one speeds up fine-tuning in Module 5 significantly)
Deep knowledge of transformer architecture

Hardware

The harness is designed to run on commodity hardware. Practical minimums:

Component	Minimum	Recommended
RAM	16 GB	32 GB
VRAM	0 GB (CPU fallback)	8 GB+
Storage	30 GB free	60 GB+
OS	Windows/macOS/Linux	Any

The default producer model (pi-qwen-32b, Qwen2.5-32B Q4_K_M) requires ~20 GB of RAM+VRAM. On a 16 GB machine you can use the 7B fallback (pi-qwen) — quality is lower but the pipeline logic is identical.

Software Setup

Clone the harness repository:

git clone https://github.com/nickmccarty/ollama-pi-harness
cd ollama-pi-harness

Install Ollama and pull the required models:

# Install Ollama (ollama.ai)
ollama pull qwen2.5:7b            # base producer (fallback)
ollama pull glm4:9b               # planner + memory compression
ollama pull Qwen3-Coder:30b       # evaluator
ollama pull llama3.2-vision       # vision preprocessing

# Create custom producer Modelfiles
ollama create pi-qwen -f Modelfile
ollama create pi-qwen-32b -f Modelfile.32b   # if 32B available

Python environment:

conda create -n ollama-pi python=3.11
conda activate ollama-pi
pip install ollama ddgs "markitdown[all]" chromadb sentence-transformers datasets

Module Arc

The harness has seven subsystems — Hooks, Agent, Research loop, Notes, Evaluation loop, Security, Signals — and this course covers all of them:

Module	Title	Core Question	HARNESS
M1	The Harness Thesis	What is harness engineering, and why does it dominate quality?	A
M2	Context Engineering & Memory	What information reaches the model, and how is it selected?	R, N
M3	Verification & Failure Modes	How do we know the output is good, and what goes wrong?	E
M4	Production Systems	How do we extend, orchestrate, secure, and observe the harness?	H, S, S
M5	Self-Improvement	How does the harness improve itself over time?	all

Modules build on each other — M3 assumes you understand M2's research pipeline, and M5's autoresearch loop only makes sense once you can read a wiggum score. Work through them in order.

References

ollama-pi-harness — ollama-pi-harness — the agentic harness this course is built on

Ollama — Ollama — run large language models locally

Overview Next →

Before You Start: Prerequisites & Learning Path

What This Course Is About

Prerequisites

Hardware

Software Setup

Module Arc

Privacy Policy

What we collect

What we don't collect

Your choices

Contact