Harness Engineering for AI Agents · The Harness Thesis

Before You Start: Prerequisites & Learning Path

8 min read
By the end of this reading you will be able to:
  • Identify the technical prerequisites for this course and assess your own readiness
  • Describe the five-module arc of the course and what you will be able to build by the end

What This Course Is About

This course teaches you to build reliable, production-grade agentic AI systems using open-source models running locally — no API keys, no per-token billing, no black-box model locks.

The central thesis is simple but counterintuitive: the harness matters more than the model. Swapping from a 7B to a 32B producer shifts quality by 10–15%. Fixing a fundamental harness flaw — missing verification, no memory, naive search — can shift quality by 50–80%. The four randomized experiments that underpin this course demonstrate this empirically.

By the end of this course you will be able to:

  • Design and run a rigorous experiment to evaluate harness changes
  • Build a saturation-gated research loop that stops searching when new results stop adding information
  • Implement an evaluate → revise → verify loop with a decimalized rubric
  • Wire a custom skill into the pipeline at any of four hook points
  • Instrument any pipeline stage with structured tracing
  • Build a DPO preference dataset from your own run logs and fine-tune a domain-specific model

Prerequisites

Required:

  • Python fluency — you should be comfortable reading and modifying 500–1,000 line Python scripts. The harness is not a library you install; it is code you read, understand, and adapt.
  • Basic ML literacy — you should know what a language model is, what fine-tuning means at a high level, and what a context window is. You do not need deep ML theory.
  • Command-line comfort — installing Ollama models, running Python scripts, reading log output. The course runs everything locally; there is no managed infrastructure.

Helpful but not required:

  • Experience with any LLM API (OpenAI, Anthropic, etc.) — the mental model transfers even though the infrastructure differs
  • Familiarity with SQLite and/or vector databases — memory.py uses both, and knowing what they are makes the Memory Systems reading click faster
  • Basic familiarity with PyTorch or transformers — useful for the QLoRA fine-tuning reading in Module 5

Not required:

  • Frontier model API access (the course runs on Ollama with local models)
  • A GPU (though one speeds up fine-tuning in Module 5 significantly)
  • Deep knowledge of transformer architecture

Hardware

The harness is designed to run on commodity hardware. Practical minimums:

Component Minimum Recommended
RAM 16 GB 32 GB
VRAM 0 GB (CPU fallback) 8 GB+
Storage 30 GB free 60 GB+
OS Windows/macOS/Linux Any

The default producer model (pi-qwen-32b, Qwen2.5-32B Q4_K_M) requires ~20 GB of RAM+VRAM. On a 16 GB machine you can use the 7B fallback (pi-qwen) — quality is lower but the pipeline logic is identical.


Software Setup

Clone the harness repository:

git clone https://github.com/nickmccarty/ollama-pi-harness
cd ollama-pi-harness

Install Ollama and pull the required models:

# Install Ollama (ollama.ai)
ollama pull qwen2.5:7b            # base producer (fallback)
ollama pull glm4:9b               # planner + memory compression
ollama pull Qwen3-Coder:30b       # evaluator
ollama pull llama3.2-vision       # vision preprocessing

# Create custom producer Modelfiles
ollama create pi-qwen -f Modelfile
ollama create pi-qwen-32b -f Modelfile.32b   # if 32B available

Python environment:

conda create -n ollama-pi python=3.11
conda activate ollama-pi
pip install ollama ddgs "markitdown[all]" chromadb sentence-transformers datasets

Module Arc

The harness has seven subsystems — Hooks, Agent, Research loop, Notes, Evaluation loop, Security, Signals — and this course covers all of them:

Module Title Core Question HARNESS
M1 The Harness Thesis What is harness engineering, and why does it dominate quality? A
M2 Context Engineering & Memory What information reaches the model, and how is it selected? R, N
M3 Verification & Failure Modes How do we know the output is good, and what goes wrong? E
M4 Production Systems How do we extend, orchestrate, secure, and observe the harness? H, S, S
M5 Self-Improvement How does the harness improve itself over time? all

Modules build on each other — M3 assumes you understand M2's research pipeline, and M5's autoresearch loop only makes sense once you can read a wiggum score. Work through them in order.