Harness Engineering for AI Agents · The Harness Thesis

Model Roles & Separation

10 min read
By the end of this reading you will be able to:
  • Explain the circular evaluation problem and why a model cannot reliably score its own output
  • Describe each model role in the harness (producer, evaluator, planner, vision) and the constraints on model selection for each
  • Configure the inference backend shim to route model calls to Ollama or vLLM using environment variables

The Circular Evaluation Problem

Design principle 5 states: evaluator and producer must be different models. This is not aesthetic preference — it is a correctness requirement.

When a model evaluates its own output, it reproduces the same reasoning and the same systematic biases it used to produce the output. If the model tends to overcount sections, it will also overcount when verifying them. If the model's training led it to prefer confident-sounding prose over concrete implementation detail, it will score confident-sounding prose highly regardless of whether implementation detail is present.

External evaluation, by a model from a different family or a significantly larger model, breaks this circularity. The evaluator applies genuinely independent judgment. The revision feedback it produces is more likely to identify the actual problems rather than rationalize existing choices.

This is the same reason academic peer review uses external reviewers, and the same reason code review is valuable even when the author is competent.

Model Roles

The harness assigns different models to different roles based on their required capabilities and the latency budget for each stage:

Role Model Why
Producer (default) pi-qwen-32b (Qwen2.5-32B Q4_K_M) Largest available model for maximum depth; custom Modelfile with optimized system prompt
Producer (fallback) pi-qwen (qwen2.5:7b) ~3× faster; use on 16 GB RAM systems or for fast iteration
Evaluator Qwen3-Coder:30b Must be larger than producer or from a different family; coding background improves specificity scoring
Planner / Compressor glm4:9b Fast enough for per-round compression; different architecture from producer prevents prompt echo
Annotator nanda-annotator QLoRA fine-tuned Qwen2.5-7B on domain-specific annotations; produced in Module 5
Vision llama3.2-vision Image-to-text preprocessing only; does not synthesize or evaluate
GitHub skill llama3.2:3b Commit message and PR generation only; fast, low-stakes

The Custom Modelfile

Ollama Modelfiles define a model's system prompt, temperature, context window, and other parameters. The producer model uses a custom Modelfile that sets:

FROM qwen2.5:32b-instruct-q4_K_M

SYSTEM """
You are a research and synthesis assistant. Your task is to produce comprehensive,
well-structured Markdown documents based on the provided research context.

Always:
- Begin output with a top-level Markdown heading (# Title)
- Use concrete examples, named tools, and specific implementation steps
- Provide implementation notes for each technique or concept covered
- Output ONLY valid Markdown — no preamble, no commentary about your approach
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 32768
PARAMETER num_predict 4096

The system prompt is where synthesis instruction engineering lives — not in the per-request user prompt. This keeps the synthesis prompt clean and makes the system instruction the sole target of autoresearch optimization.

The Inference Backend Shim

inference.py provides a unified interface that routes model calls to either Ollama (default) or vLLM, using the same call signature throughout the codebase:

from inference import OllamaLike

client = OllamaLike()  # reads INFERENCE_BACKEND env var
response = client.chat(
    model="pi-qwen-32b",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": 0.3, "num_predict": 4096}
)
content = response["message"]["content"]
tokens  = response["usage"]["total_tokens"]

When INFERENCE_BACKEND=vllm, the shim remaps model names via VLLM_MODEL_MAP and translates the Ollama response format to match what the rest of the harness expects. The harness code never checks which backend is running.

vLLM Setup (GPU-Limited Systems)

For systems where the 32B model is too large for Ollama (requires ~20 GB unified RAM+VRAM), vLLM with AWQ int4 quantization reduces the memory footprint to ~9.4 GB:

# In WSL2 Ubuntu:
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
  --dtype half --quantization awq_marlin \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90

Then in .env:

INFERENCE_BACKEND=vllm
VLLM_BASE_URL=http://localhost:8000/v1
VLLM_MODEL_MAP={"pi-qwen-32b":"Qwen/Qwen2.5-14B-Instruct-AWQ"}

The model map is how a single physical model serves multiple logical roles — pi-qwen-32b in the harness code becomes Qwen/Qwen2.5-14B-Instruct-AWQ on the vLLM server. The harness never sees the underlying model name.

Configuration via Environment Variables

All model role assignments are configurable without code changes:

# Override producer
export HARNESS_PRODUCER_MODEL=mistral-small3.1:24b

# Override evaluator
export WIGGUM_EVALUATOR_MODEL=phi4:14b

# Override planner/compressor
export PLANNER_MODEL=qwen2.5:7b
export COMPRESS_MODEL=qwen2.5:7b   # can differ from planner

# Switch inference backend
export INFERENCE_BACKEND=vllm

This is what "build for deletion" looks like in practice: any model in any role can be replaced with a single environment variable change. The harness does not hardcode model names in business logic.