Harness Engineering for AI Agents · The Harness Thesis

Experimental Methodology

12 min read
By the end of this reading you will be able to:
  • Design a one-factor completely randomized experiment to evaluate a harness change, specifying factor levels, replications, response variables, and controlled variables
  • Interpret a runs.jsonl entry and identify which fields correspond to response variables of interest
  • Explain why run order is randomized in harness experiments and what systematic biases randomization prevents

Why Experiments Matter

Most harness engineering happens by intuition: add a verification loop, run a few tasks, look at the output, judge it better or worse. The problem with this approach is that informal judgment is noisy. Model output has high variance — the same task on the same model can produce very different results on consecutive runs. Without a rigorous evaluation framework, you cannot distinguish a genuine harness improvement from a lucky sample.

The four experiments underpinning this course all use the same design: a one-factor completely randomized design (CRD). This is the simplest design that gives interpretable results while controlling for the major sources of confounding.

CRD Design Elements

A completely randomized design has four components:

Factor: The thing you are changing. In Experiment 1, the factor was task type (three levels: T_A context engineering, T_B cost management, T_C agent failure modes). In Experiment 2, the factor was harness configuration (baseline vs. upgraded). The factor is always one thing — one change at a time.

Levels: The specific values the factor takes. Experiment 1 had three levels (T_A, T_B, T_C). Experiment 2 had two levels (before/after). More levels give richer data but require more runs.

Replications: How many times each level is repeated. All four experiments used 3 replications per level, giving 9 total runs (3 levels × 3 replications). Three replications is the minimum for estimating variance; five is better if time permits.

Randomized run order: The order in which runs are executed is randomized. In Experiment 1, the nine runs [A, A, A, B, B, B, C, C, C] were shuffled to produce a random order like [C, A, B, A, C, B, C, B, A]. This is not optional — it prevents systematic bias from search result drift (the same query on the same day returns similar results) and model cache effects (warm models respond differently from cold ones).

Response Variables

Response variables are the outcomes you measure. For harness experiments the standard set is:

Variable Description Direction
output_bytes Size of written file in bytes Higher = richer
output_lines Line count of written Markdown Higher = more structured
first_wiggum_score Evaluator score on round 1 Higher = better first pass
wiggum_rounds Number of revise cycles needed Lower = more efficient
final PASS / FAIL / ERROR PASS = success
total_search_chars Merged search result characters Diagnostic

The most informative single variable is first_wiggum_score — the quality of the output before any revision. A harness that gets it right first time is better than one that needs two revision rounds to reach the same score, because revision rounds cost tokens and latency.

Controlled Variables

Controlled variables are held constant across all runs in an experiment. This is what makes comparisons valid. Standard controlled variables:

Variable Experiment 1 value
Producer model pi-qwen (qwen2.5:7b)
Evaluator model glm4:9b
Wiggum max rounds 3
Wiggum pass threshold score ≥ 8.0
Searches per task 2
Search quality floor 1,800 characters

When you run Experiment 2 (harness upgrade impact), the controlled variables are the same — you are changing only the harness configuration. This is how you isolate the effect of the change.

The runs.jsonl Record

Every run appends a JSON object to runs.jsonl. A typical entry looks like:

{
  "run_id": "20260410_143201_context-engineering",
  "timestamp": "2026-04-10T14:32:01",
  "task": "Search for the top 5 context engineering techniques...",
  "task_type": "enumerated",
  "producer_model": "pi-qwen-32b",
  "evaluator_model": "Qwen3-Coder:30b",
  "tokens_by_stage": {
    "planning": 1240,
    "research_compression": 3820,
    "synthesis": 8450,
    "wiggum_r1": 2100,
    "wiggum_r2": 1980
  },
  "wiggum_scores": {
    "r1": {"relevance": 9, "completeness": 8, "depth": 7, "specificity": 7, "structure": 9},
    "r2": {"relevance": 9, "completeness": 9, "depth": 8, "specificity": 8, "structure": 9}
  },
  "wiggum_rounds": 2,
  "final_score": 8.6,
  "final": "PASS",
  "output_path": "~/Desktop/context-engineering.md",
  "output_bytes": 4821,
  "output_lines": 142,
  "session_id": "sess_20260410"
}

Every field here corresponds to a response variable or a controlled variable. The tokens_by_stage field enables cost analysis (how expensive was planning vs. synthesis?). The wiggum_scores field enables rubric analysis (which dimension is weakest?).

Analyzing Results

The analytics.py script reads runs.jsonl and computes summary statistics per task type and per experiment:

python analytics.py         # cross-run stats
python analytics.py --full  # per-run detail

For more targeted analysis, filter runs.jsonl by session or timestamp:

import json

runs = [json.loads(line) for line in open('runs.jsonl')]

# Experiment 2: compare before/after harness upgrade
before = [r for r in runs if r['session_id'] == 'exp02_baseline']
after  = [r for r in runs if r['session_id'] == 'exp02_upgraded']

mean_before = sum(r['first_wiggum_score'] for r in before) / len(before)
mean_after  = sum(r['first_wiggum_score'] for r in after)  / len(after)
print(f'Before: {mean_before:.2f}  After: {mean_after:.2f}  Delta: {mean_after - mean_before:+.2f}')

The Keep Rule

For autoresearch experiments (Module 5), a keep rule governs whether a harness change is committed:

Keep the change if composite_score_delta > 0.1. Otherwise, revert with git reset HEAD~1 --soft.

This is a strict threshold — a 0.1-point improvement on a 0–10 scale is small but meaningful, because composite scores are averages over multiple tasks and replications. Anything smaller is within noise.

For manual harness experiments, apply the same discipline: if the improvement is within one standard deviation of the mean score, it is not a reliable improvement.