Harness Engineering for AI Agents · Self-Improvement

The Data Pipeline

12 min read
By the end of this reading you will be able to:
  • Identify which fields in a runs.jsonl entry map to SFT training examples, DPO preference pairs, and reward model data
  • Explain the two sources of DPO preference pairs (cross-run pairs and wiggum-revision pairs) and why each represents a valid preference signal
  • Trace how hf_export.py transforms runs.jsonl into HuggingFace-ready datasets and describe the quality filter applied to each dataset type

runs.jsonl as Training Data Source

Every harness run is a labeled example. The task string is the input. The final output (after Wiggum verification) is the generated text. The Wiggum score is the quality label. The revision history is a preference signal.

The data pipeline converts runs.jsonl into four HuggingFace-ready dataset formats, each suited to a different training objective.

Dataset Types

SFT (Supervised Fine-Tuning)

High-quality runs as instruction-following examples:

# Format: {prompt, completion} pairs
{
  "prompt": "<system>...synthesis instruction...</system>\n<user>" + task + "</user>",
  "completion": final_output
}

Quality filter: final_score >= sft_min_score (default 8.0). Only PASS runs with strong Wiggum scores are used — SFT on mediocre examples degrades the model.

Typical SFT dataset size: ~300–500 examples after applying the quality filter to 1,500 runs.

DPO (Direct Preference Optimization)

Preference pairs where the model sees a task and must prefer one completion over another:

# Format: {prompt, chosen, rejected}
{
  "prompt": task,
  "chosen": high_score_output,    # the preferred completion
  "rejected": low_score_output     # the rejected completion
}

Two sources of preference pairs:

Cross-run pairs: Two runs on the same task (or semantically similar tasks) with different Wiggum scores. The higher-scoring run is chosen; the lower is rejected. Pairs are only kept when score_delta >= min_delta (default 0.5).

Wiggum-revision pairs: For runs with multiple Wiggum rounds, round 1 output and round 2 output are available. Round 2 has higher quality (it's the revision). Using (round 1 output, round 2 output) as (rejected, chosen) creates preference pairs from a single run.

# build_dpo_dataset.py
def build_revision_pairs(runs, min_delta=0.5):
    pairs = []
    for run in runs:
        if run['wiggum_rounds'] < 2:
            continue  # no revision available
        r1_score = run['wiggum_r1_score']
        r2_score = run['wiggum_scores']['r2']['weighted']
        delta = r2_score - r1_score
        if delta < min_delta:
            continue  # preference signal too weak
        pairs.append({
            "prompt": run['task'],
            "chosen": run['output_r2'],   # revised, higher quality
            "rejected": run['output_r1']  # original, lower quality
        })
    return pairs

Reward Model

Inputs paired with scalar scores for training a reward model:

{
  "prompt": task,
  "completion": output,
  "score": final_score  # 0.0 – 10.0
}

All runs are included (not filtered by score) — the reward model needs to learn the full quality distribution, including bad examples.

Trajectory

Multi-turn records showing the evaluation → revision history:

{
  "task": task,
  "turns": [
    {"role": "assistant", "content": output_r1},
    {"role": "user",      "content": issues_from_wiggum_r1},
    {"role": "assistant", "content": output_r2},
    {"role": "user",      "content": issues_from_wiggum_r2},
    # ...
  ],
  "final_score": final_score
}

Trajectory data is useful for training models that understand revision as a multi-turn process — not just produce-and-submit, but produce-receive-feedback-revise.

hf_export.py

# Export to hf_datasets/ directory:
python hf_export.py
python hf_export.py --sft-min-score 8.5   # stricter quality filter

# Push to HuggingFace Hub:
python hf_export.py --push nickmccarty/ollama-pi-harness-datasets

The script reads runs.jsonl, applies the format conversions above, and writes:

hf_datasets/
  sft.jsonl         # instruction-following examples (high score only)
  preference.jsonl  # {prompt, chosen, rejected}
  reward.jsonl      # {prompt, completion, score}
  trajectory.jsonl  # multi-turn revision histories
  dpo.jsonl         # built by build_dpo_dataset.py (separate script)

DPO Dataset Statistics

From 1,500 runs with default settings (--min-delta 0.5):

python build_dpo_dataset.py --stats
# Cross-run pairs:      142 (delta >= 0.5)
# Revision pairs:        89 (wiggum_rounds >= 2, delta >= 0.5)
# Total DPO pairs:      231

Increasing --min-delta to 1.0 reduces the dataset but makes preference signals stronger — less noise, fewer pairs. For fine-tuning, stronger preference signals are generally preferred over larger noisy datasets.

Why This Matters

The data pipeline closes the self-improvement loop:

runs.jsonl → SFT/DPO datasets → fine-tuned model → improved producer
     ↑                                                      ↓
     └──────────── better runs → better data ←─────────────┘

Each iteration of the harness produces training data that makes the next iteration's model better, which produces better runs, which produces better training data. The compound effect requires many cycles — but the data generation is free (it's the work the harness is already doing) and the fine-tuning is local (no API costs).