Experimental Methodology
- Design a one-factor completely randomized experiment to evaluate a harness change, specifying factor levels, replications, response variables, and controlled variables
- Interpret a runs.jsonl entry and identify which fields correspond to response variables of interest
- Explain why run order is randomized in harness experiments and what systematic biases randomization prevents
Why Experiments Matter
Most harness engineering happens by intuition: add a verification loop, run a few tasks, look at the output, judge it better or worse. The problem with this approach is that informal judgment is noisy. Model output has high variance — the same task on the same model can produce very different results on consecutive runs. Without a rigorous evaluation framework, you cannot distinguish a genuine harness improvement from a lucky sample.
The four experiments underpinning this course all use the same design: a one-factor completely randomized design (CRD). This is the simplest design that gives interpretable results while controlling for the major sources of confounding.
CRD Design Elements
A completely randomized design has four components:
Factor: The thing you are changing. In Experiment 1, the factor was task type (three levels: T_A context engineering, T_B cost management, T_C agent failure modes). In Experiment 2, the factor was harness configuration (baseline vs. upgraded). The factor is always one thing — one change at a time.
Levels: The specific values the factor takes. Experiment 1 had three levels (T_A, T_B, T_C). Experiment 2 had two levels (before/after). More levels give richer data but require more runs.
Replications: How many times each level is repeated. All four experiments used 3 replications per level, giving 9 total runs (3 levels × 3 replications). Three replications is the minimum for estimating variance; five is better if time permits.
Randomized run order: The order in which runs are executed is randomized. In Experiment 1, the nine runs [A, A, A, B, B, B, C, C, C] were shuffled to produce a random order like [C, A, B, A, C, B, C, B, A]. This is not optional — it prevents systematic bias from search result drift (the same query on the same day returns similar results) and model cache effects (warm models respond differently from cold ones).
Response Variables
Response variables are the outcomes you measure. For harness experiments the standard set is:
| Variable | Description | Direction |
|---|---|---|
output_bytes |
Size of written file in bytes | Higher = richer |
output_lines |
Line count of written Markdown | Higher = more structured |
first_wiggum_score |
Evaluator score on round 1 | Higher = better first pass |
wiggum_rounds |
Number of revise cycles needed | Lower = more efficient |
final |
PASS / FAIL / ERROR | PASS = success |
total_search_chars |
Merged search result characters | Diagnostic |
The most informative single variable is first_wiggum_score — the quality of the output before any revision. A harness that gets it right first time is better than one that needs two revision rounds to reach the same score, because revision rounds cost tokens and latency.
Controlled Variables
Controlled variables are held constant across all runs in an experiment. This is what makes comparisons valid. Standard controlled variables:
| Variable | Experiment 1 value |
|---|---|
| Producer model | pi-qwen (qwen2.5:7b) |
| Evaluator model | glm4:9b |
| Wiggum max rounds | 3 |
| Wiggum pass threshold | score ≥ 8.0 |
| Searches per task | 2 |
| Search quality floor | 1,800 characters |
When you run Experiment 2 (harness upgrade impact), the controlled variables are the same — you are changing only the harness configuration. This is how you isolate the effect of the change.
The runs.jsonl Record
Every run appends a JSON object to runs.jsonl. A typical entry looks like:
{
"run_id": "20260410_143201_context-engineering",
"timestamp": "2026-04-10T14:32:01",
"task": "Search for the top 5 context engineering techniques...",
"task_type": "enumerated",
"producer_model": "pi-qwen-32b",
"evaluator_model": "Qwen3-Coder:30b",
"tokens_by_stage": {
"planning": 1240,
"research_compression": 3820,
"synthesis": 8450,
"wiggum_r1": 2100,
"wiggum_r2": 1980
},
"wiggum_scores": {
"r1": {"relevance": 9, "completeness": 8, "depth": 7, "specificity": 7, "structure": 9},
"r2": {"relevance": 9, "completeness": 9, "depth": 8, "specificity": 8, "structure": 9}
},
"wiggum_rounds": 2,
"final_score": 8.6,
"final": "PASS",
"output_path": "~/Desktop/context-engineering.md",
"output_bytes": 4821,
"output_lines": 142,
"session_id": "sess_20260410"
}
Every field here corresponds to a response variable or a controlled variable. The tokens_by_stage field enables cost analysis (how expensive was planning vs. synthesis?). The wiggum_scores field enables rubric analysis (which dimension is weakest?).
Analyzing Results
The analytics.py script reads runs.jsonl and computes summary statistics per task type and per experiment:
python analytics.py # cross-run stats
python analytics.py --full # per-run detail
For more targeted analysis, filter runs.jsonl by session or timestamp:
import json
runs = [json.loads(line) for line in open('runs.jsonl')]
# Experiment 2: compare before/after harness upgrade
before = [r for r in runs if r['session_id'] == 'exp02_baseline']
after = [r for r in runs if r['session_id'] == 'exp02_upgraded']
mean_before = sum(r['first_wiggum_score'] for r in before) / len(before)
mean_after = sum(r['first_wiggum_score'] for r in after) / len(after)
print(f'Before: {mean_before:.2f} After: {mean_after:.2f} Delta: {mean_after - mean_before:+.2f}')
The Keep Rule
For autoresearch experiments (Module 5), a keep rule governs whether a harness change is committed:
Keep the change if
composite_score_delta > 0.1. Otherwise, revert withgit reset HEAD~1 --soft.
This is a strict threshold — a 0.1-point improvement on a 0–10 scale is small but meaningful, because composite scores are averages over multiple tasks and replications. Anything smaller is within noise.
For manual harness experiments, apply the same discipline: if the improvement is within one standard deviation of the mean score, it is not a reliable improvement.