The Data Pipeline
- Identify which fields in a runs.jsonl entry map to SFT training examples, DPO preference pairs, and reward model data
- Explain the two sources of DPO preference pairs (cross-run pairs and wiggum-revision pairs) and why each represents a valid preference signal
- Trace how hf_export.py transforms runs.jsonl into HuggingFace-ready datasets and describe the quality filter applied to each dataset type
runs.jsonl as Training Data Source
Every harness run is a labeled example. The task string is the input. The final output (after Wiggum verification) is the generated text. The Wiggum score is the quality label. The revision history is a preference signal.
The data pipeline converts runs.jsonl into four HuggingFace-ready dataset formats, each suited to a different training objective.
Dataset Types
SFT (Supervised Fine-Tuning)
High-quality runs as instruction-following examples:
# Format: {prompt, completion} pairs
{
"prompt": "<system>...synthesis instruction...</system>\n<user>" + task + "</user>",
"completion": final_output
}
Quality filter: final_score >= sft_min_score (default 8.0). Only PASS runs with strong Wiggum scores are used — SFT on mediocre examples degrades the model.
Typical SFT dataset size: ~300–500 examples after applying the quality filter to 1,500 runs.
DPO (Direct Preference Optimization)
Preference pairs where the model sees a task and must prefer one completion over another:
# Format: {prompt, chosen, rejected}
{
"prompt": task,
"chosen": high_score_output, # the preferred completion
"rejected": low_score_output # the rejected completion
}
Two sources of preference pairs:
Cross-run pairs: Two runs on the same task (or semantically similar tasks) with different Wiggum scores. The higher-scoring run is chosen; the lower is rejected. Pairs are only kept when score_delta >= min_delta (default 0.5).
Wiggum-revision pairs: For runs with multiple Wiggum rounds, round 1 output and round 2 output are available. Round 2 has higher quality (it's the revision). Using (round 1 output, round 2 output) as (rejected, chosen) creates preference pairs from a single run.
# build_dpo_dataset.py
def build_revision_pairs(runs, min_delta=0.5):
pairs = []
for run in runs:
if run['wiggum_rounds'] < 2:
continue # no revision available
r1_score = run['wiggum_r1_score']
r2_score = run['wiggum_scores']['r2']['weighted']
delta = r2_score - r1_score
if delta < min_delta:
continue # preference signal too weak
pairs.append({
"prompt": run['task'],
"chosen": run['output_r2'], # revised, higher quality
"rejected": run['output_r1'] # original, lower quality
})
return pairs
Reward Model
Inputs paired with scalar scores for training a reward model:
{
"prompt": task,
"completion": output,
"score": final_score # 0.0 – 10.0
}
All runs are included (not filtered by score) — the reward model needs to learn the full quality distribution, including bad examples.
Trajectory
Multi-turn records showing the evaluation → revision history:
{
"task": task,
"turns": [
{"role": "assistant", "content": output_r1},
{"role": "user", "content": issues_from_wiggum_r1},
{"role": "assistant", "content": output_r2},
{"role": "user", "content": issues_from_wiggum_r2},
# ...
],
"final_score": final_score
}
Trajectory data is useful for training models that understand revision as a multi-turn process — not just produce-and-submit, but produce-receive-feedback-revise.
hf_export.py
# Export to hf_datasets/ directory:
python hf_export.py
python hf_export.py --sft-min-score 8.5 # stricter quality filter
# Push to HuggingFace Hub:
python hf_export.py --push nickmccarty/ollama-pi-harness-datasets
The script reads runs.jsonl, applies the format conversions above, and writes:
hf_datasets/
sft.jsonl # instruction-following examples (high score only)
preference.jsonl # {prompt, chosen, rejected}
reward.jsonl # {prompt, completion, score}
trajectory.jsonl # multi-turn revision histories
dpo.jsonl # built by build_dpo_dataset.py (separate script)
DPO Dataset Statistics
From 1,500 runs with default settings (--min-delta 0.5):
python build_dpo_dataset.py --stats
# Cross-run pairs: 142 (delta >= 0.5)
# Revision pairs: 89 (wiggum_rounds >= 2, delta >= 0.5)
# Total DPO pairs: 231
Increasing --min-delta to 1.0 reduces the dataset but makes preference signals stronger — less noise, fewer pairs. For fine-tuning, stronger preference signals are generally preferred over larger noisy datasets.
Why This Matters
The data pipeline closes the self-improvement loop:
runs.jsonl → SFT/DPO datasets → fine-tuned model → improved producer
↑ ↓
└──────────── better runs → better data ←─────────────┘
Each iteration of the harness produces training data that makes the next iteration's model better, which produces better runs, which produces better training data. The compound effect requires many cycles — but the data generation is free (it's the work the harness is already doing) and the fine-tuning is local (no API costs).