Observability
- Instrument a new pipeline stage with Chrome Trace Event spans using RunTrace, and verify the output in Perfetto UI
- Interpret a Perfetto waterfall trace to identify the bottleneck stage and the latency of the panel evaluation threads
- Describe the runs.jsonl schema at the field level, explain how the leverage metric is computed, and filter runs by session, task type, or score range for targeted analysis
The Observability Stack
The harness uses three complementary observability mechanisms:
- Chrome Trace Events in
traces/— per-stage wall-clock latencies, visualized in Perfetto UI - runs.jsonl — structured per-run records for quantitative analysis
- inspect_run.py — quick human-readable summary of recent runs
These are not redundant. Traces answer when did each stage run and how long did it take? Run logs answer what quality did each run achieve and what did it cost? inspect_run.py answers what happened in the last run?
Chrome Trace Events
The Chrome Trace Event format is a JSON-based profiling format supported by Perfetto UI, the Chrome DevTools performance tab, and Speedscope. Every harness run writes a trace file:
traces/20260410_143201_context-engineering.json
Load it at ui.perfetto.dev by dragging the file onto the page. The waterfall view shows each stage as a horizontal bar spanning its wall-clock duration.
RunTrace
logger.py provides a context-manager-based RunTrace class:
from logger import RunTrace
with RunTrace(task=task, run_id=run_id) as trace:
# Stage spans:
with trace.span("planning"):
plan = make_plan(task, memory_context)
with trace.span("research"):
research = gather_research(task, plan)
with trace.span("synthesis"):
output = synthesize(task, research, plan)
# Log LLM usage at any point:
trace.log_usage(response, stage="synthesis")
# Name threads for panel evaluation:
trace.name_thread("panel/Domain-Practitioner")
The trace.span() context manager emits a begin event on entry and an end event on exit, with the wall-clock timestamp and duration. The resulting JSON trace uses the ph: "X" (complete event) format:
{
"traceEvents": [
{"ph": "X", "name": "planning", "ts": 1744291321000000, "dur": 4200000, "pid": 1, "tid": 1},
{"ph": "X", "name": "research", "ts": 1744291325200000, "dur": 42000000, "pid": 1, "tid": 1},
{"ph": "X", "name": "synthesis", "ts": 1744291367200000, "dur": 95000000, "pid": 1, "tid": 1},
{"ph": "X", "name": "wiggum_r1", "ts": 1744291462200000, "dur": 28000000, "pid": 1, "tid": 1},
{"ph": "X", "name": "panel/Domain-Practitioner", "ts": 1744291490200000, "dur": 35000000, "pid": 1, "tid": 2},
{"ph": "X", "name": "panel/Critical-Reviewer", "ts": 1744291490200000, "dur": 32000000, "pid": 1, "tid": 3}
]
}
The tid (thread ID) field creates separate lanes in the Perfetto waterfall — the panel threads appear in parallel lanes, making the parallelism immediately visible.
Instrumenting a New Stage
Adding observability to a new pipeline stage:
def my_new_stage(task: str, context: str, trace: RunTrace) -> str:
with trace.span("my_new_stage"):
# ... do work ...
result = call_model(prompt)
trace.log_usage(result, stage="my_new_stage") # token accounting
return result["message"]["content"]
The trace is passed as a parameter through the call stack — it is not a global. This makes it explicit which stages are instrumented and testable without a full pipeline run.
runs.jsonl Schema
The full schema of a runs.jsonl record:
@dataclass
class RunRecord:
run_id: str # timestamp + task slug
session_id: str # experiment session identifier
timestamp: str # ISO 8601
task: str # original task string
task_type: str # enumerated | best_practices | research
producer_model: str
evaluator_model: str
active_skills: list # skill names that were active
tokens_by_stage: dict # token counts per stage (see previous reading)
wiggum_scores: dict # {"r1": dims_dict, "r2": dims_dict, ...}
wiggum_rounds: int
final_score: float # weighted score at final round
wiggum_r1_score: float # first-pass score (key metric)
final: str # PASS | FAIL | ERROR
output_path: str
output_bytes: int
output_lines: int
criteria_results: dict # {criterion_id: True/False}
criteria_rate: float # fraction of criteria passing
novelty_scores: list # [int] per search round
search_rounds: int
The Leverage Metric
Every runs.jsonl record includes a leverage field — a composite reward signal that combines output quality, task complexity, and computational cost:
# In logger.py, computed at run close:
if tac_hours and scores:
quality_norm = scores[-1] / 10.0 # Wiggum final score, normalized
tac_s = tac_hours * 3600.0 # task complexity in seconds
energy_cost = float(os.environ.get("HARNESS_ENERGY_COST_PER_RUN", 0.0))
infer_cost = float(os.environ.get("HARNESS_INFERENCE_COST_PER_RUN", 0.0))
hourly_rate = float(os.environ.get("HARNESS_HOURLY_RATE", 75.0))
cost_s = ((energy_cost + infer_cost) / hourly_rate) * 3600.0
leverage = round(
(tac_s * quality_norm) / max(runtime_s + cost_s, 1.0), 2
)
elif scores and output_lines and runtime_s > 0:
# Proxy formula — derivable from every run without TAC estimate
leverage = round(
scores[-1] * output_lines / max(runtime_s / 3600.0, 1e-6), 2
)
The full formula: leverage = (TAC_seconds × quality_normalized) / (runtime_seconds + cost_seconds)
- TAC (Time-Adjusted Complexity) — estimated hours a human expert would need to complete the same task. A 5-technique research summary might be TAC=1.0h; a full literature review might be TAC=8.0h. Set via environment variable or task metadata.
- quality_normalized — the Wiggum final score divided by 10.
- cost_seconds — inference and energy costs converted to equivalent human-hours at
HARNESS_HOURLY_RATE(default: $75/hr), then to seconds.
The proxy formula (when TAC is not set): score × output_lines / runtime_hours — rougher but always computable. A 10-minute run producing 150 lines at score 8.5 → 8.5 × 150 / (10/60) = 7,650.
Why leverage matters for harness design: leverage is the RL reward signal for autoresearch optimization. A harness configuration that produces 9.0-scoring outputs in 6 minutes on 2-hour tasks outperforms one that produces 8.7-scoring outputs in 8 minutes — even though the absolute score difference is small. Tracking leverage over sessions lets you identify whether synthesis instruction changes are actually improving efficiency or just shifting the quality-latency tradeoff.
{
"leverage": 14.3,
"final_score": 8.85,
"wiggum_rounds": 1,
"runtime_s": 312
}
A leverage of 14.3 means the harness produced roughly 14× the value in the time it took to run, normalized to the reference hourly rate. Sessions where mean leverage rises indicate genuine improvement; sessions where it falls despite rising scores indicate a latency regression.
inspect_run.py
For quick human-readable review after a run:
python inspect_run.py # last run, full detail
python inspect_run.py 3 # last 3 runs, summary
python inspect_run.py --all # all runs, table summary
Typical output for a single run:
Run: 20260410_143201_context-engineering
Task: Search for the top 5 context engineering techniques...
Model: pi-qwen-32b → Qwen3-Coder:30b (evaluator)
Result: PASS (2 rounds)
Wiggum R1: 7.65 → R2: 8.45 (PASS)
Relevance: 9.0 → 9.0
Completeness: 8.0 → 9.0
Depth: 7.0 → 8.0 ← improved
Specificity: 6.0 → 7.0 ← improved
Structure: 9.0 → 9.0
Tokens: 18,900 total
Synthesis: 8,450 (45%)
Research: 3,820 (20%)
Wiggum: 4,080 (22%)
Other: 2,550 (13%)
Output: ~/Desktop/context-engineering.md (4,821 bytes, 142 lines)
Cross-Run Analysis
For systematic analysis across multiple runs, filter runs.jsonl directly:
import json
runs = [json.loads(l) for l in open('runs.jsonl')]
# Depth score trend over time:
depth_over_time = [
(r['timestamp'], r['wiggum_scores'].get('r1', {}).get('depth', 0))
for r in runs
]
# Runs that needed revision (rounds > 1) by task type:
need_revision = {
task_type: sum(1 for r in runs
if r['task_type'] == task_type and r['wiggum_rounds'] > 1)
for task_type in ['enumerated', 'best_practices', 'research']
}
Or use analytics.py for the standard cross-run report:
python analytics.py # mean scores by task type + model
python analytics.py --full # per-run table