Harness Engineering for AI Agents · Production Systems

Observability

12 min read
By the end of this reading you will be able to:
  • Instrument a new pipeline stage with Chrome Trace Event spans using RunTrace, and verify the output in Perfetto UI
  • Interpret a Perfetto waterfall trace to identify the bottleneck stage and the latency of the panel evaluation threads
  • Describe the runs.jsonl schema at the field level, explain how the leverage metric is computed, and filter runs by session, task type, or score range for targeted analysis

The Observability Stack

The harness uses three complementary observability mechanisms:

  1. Chrome Trace Events in traces/ — per-stage wall-clock latencies, visualized in Perfetto UI
  2. runs.jsonl — structured per-run records for quantitative analysis
  3. inspect_run.py — quick human-readable summary of recent runs

These are not redundant. Traces answer when did each stage run and how long did it take? Run logs answer what quality did each run achieve and what did it cost? inspect_run.py answers what happened in the last run?

Chrome Trace Events

The Chrome Trace Event format is a JSON-based profiling format supported by Perfetto UI, the Chrome DevTools performance tab, and Speedscope. Every harness run writes a trace file:

traces/20260410_143201_context-engineering.json

Load it at ui.perfetto.dev by dragging the file onto the page. The waterfall view shows each stage as a horizontal bar spanning its wall-clock duration.

RunTrace

logger.py provides a context-manager-based RunTrace class:

from logger import RunTrace

with RunTrace(task=task, run_id=run_id) as trace:
    # Stage spans:
    with trace.span("planning"):
        plan = make_plan(task, memory_context)

    with trace.span("research"):
        research = gather_research(task, plan)

    with trace.span("synthesis"):
        output = synthesize(task, research, plan)

    # Log LLM usage at any point:
    trace.log_usage(response, stage="synthesis")

    # Name threads for panel evaluation:
    trace.name_thread("panel/Domain-Practitioner")

The trace.span() context manager emits a begin event on entry and an end event on exit, with the wall-clock timestamp and duration. The resulting JSON trace uses the ph: "X" (complete event) format:

{
  "traceEvents": [
    {"ph": "X", "name": "planning",   "ts": 1744291321000000, "dur": 4200000, "pid": 1, "tid": 1},
    {"ph": "X", "name": "research",   "ts": 1744291325200000, "dur": 42000000, "pid": 1, "tid": 1},
    {"ph": "X", "name": "synthesis",  "ts": 1744291367200000, "dur": 95000000, "pid": 1, "tid": 1},
    {"ph": "X", "name": "wiggum_r1",  "ts": 1744291462200000, "dur": 28000000, "pid": 1, "tid": 1},
    {"ph": "X", "name": "panel/Domain-Practitioner", "ts": 1744291490200000, "dur": 35000000, "pid": 1, "tid": 2},
    {"ph": "X", "name": "panel/Critical-Reviewer",   "ts": 1744291490200000, "dur": 32000000, "pid": 1, "tid": 3}
  ]
}

The tid (thread ID) field creates separate lanes in the Perfetto waterfall — the panel threads appear in parallel lanes, making the parallelism immediately visible.

Instrumenting a New Stage

Adding observability to a new pipeline stage:

def my_new_stage(task: str, context: str, trace: RunTrace) -> str:
    with trace.span("my_new_stage"):
        # ... do work ...
        result = call_model(prompt)
        trace.log_usage(result, stage="my_new_stage")  # token accounting
        return result["message"]["content"]

The trace is passed as a parameter through the call stack — it is not a global. This makes it explicit which stages are instrumented and testable without a full pipeline run.

runs.jsonl Schema

The full schema of a runs.jsonl record:

@dataclass
class RunRecord:
    run_id: str            # timestamp + task slug
    session_id: str        # experiment session identifier
    timestamp: str         # ISO 8601
    task: str              # original task string
    task_type: str         # enumerated | best_practices | research
    producer_model: str
    evaluator_model: str
    active_skills: list    # skill names that were active
    tokens_by_stage: dict  # token counts per stage (see previous reading)
    wiggum_scores: dict    # {"r1": dims_dict, "r2": dims_dict, ...}
    wiggum_rounds: int
    final_score: float     # weighted score at final round
    wiggum_r1_score: float # first-pass score (key metric)
    final: str             # PASS | FAIL | ERROR
    output_path: str
    output_bytes: int
    output_lines: int
    criteria_results: dict # {criterion_id: True/False}
    criteria_rate: float   # fraction of criteria passing
    novelty_scores: list   # [int] per search round
    search_rounds: int

The Leverage Metric

Every runs.jsonl record includes a leverage field — a composite reward signal that combines output quality, task complexity, and computational cost:

# In logger.py, computed at run close:
if tac_hours and scores:
    quality_norm = scores[-1] / 10.0           # Wiggum final score, normalized
    tac_s        = tac_hours * 3600.0          # task complexity in seconds
    energy_cost  = float(os.environ.get("HARNESS_ENERGY_COST_PER_RUN", 0.0))
    infer_cost   = float(os.environ.get("HARNESS_INFERENCE_COST_PER_RUN", 0.0))
    hourly_rate  = float(os.environ.get("HARNESS_HOURLY_RATE", 75.0))
    cost_s       = ((energy_cost + infer_cost) / hourly_rate) * 3600.0

    leverage = round(
        (tac_s * quality_norm) / max(runtime_s + cost_s, 1.0), 2
    )
elif scores and output_lines and runtime_s > 0:
    # Proxy formula — derivable from every run without TAC estimate
    leverage = round(
        scores[-1] * output_lines / max(runtime_s / 3600.0, 1e-6), 2
    )

The full formula: leverage = (TAC_seconds × quality_normalized) / (runtime_seconds + cost_seconds)

  • TAC (Time-Adjusted Complexity) — estimated hours a human expert would need to complete the same task. A 5-technique research summary might be TAC=1.0h; a full literature review might be TAC=8.0h. Set via environment variable or task metadata.
  • quality_normalized — the Wiggum final score divided by 10.
  • cost_seconds — inference and energy costs converted to equivalent human-hours at HARNESS_HOURLY_RATE (default: $75/hr), then to seconds.

The proxy formula (when TAC is not set): score × output_lines / runtime_hours — rougher but always computable. A 10-minute run producing 150 lines at score 8.5 → 8.5 × 150 / (10/60) = 7,650.

Why leverage matters for harness design: leverage is the RL reward signal for autoresearch optimization. A harness configuration that produces 9.0-scoring outputs in 6 minutes on 2-hour tasks outperforms one that produces 8.7-scoring outputs in 8 minutes — even though the absolute score difference is small. Tracking leverage over sessions lets you identify whether synthesis instruction changes are actually improving efficiency or just shifting the quality-latency tradeoff.

{
  "leverage": 14.3,
  "final_score": 8.85,
  "wiggum_rounds": 1,
  "runtime_s": 312
}

A leverage of 14.3 means the harness produced roughly 14× the value in the time it took to run, normalized to the reference hourly rate. Sessions where mean leverage rises indicate genuine improvement; sessions where it falls despite rising scores indicate a latency regression.

inspect_run.py

For quick human-readable review after a run:

python inspect_run.py          # last run, full detail
python inspect_run.py 3        # last 3 runs, summary
python inspect_run.py --all    # all runs, table summary

Typical output for a single run:

Run: 20260410_143201_context-engineering
Task: Search for the top 5 context engineering techniques...
Model: pi-qwen-32b → Qwen3-Coder:30b (evaluator)
Result: PASS (2 rounds)

Wiggum R1: 7.65  →  R2: 8.45 (PASS)
  Relevance:    9.0 → 9.0
  Completeness: 8.0 → 9.0
  Depth:        7.0 → 8.0  ← improved
  Specificity:  6.0 → 7.0  ← improved
  Structure:    9.0 → 9.0

Tokens: 18,900 total
  Synthesis:    8,450 (45%)
  Research:     3,820 (20%)
  Wiggum:       4,080 (22%)
  Other:        2,550 (13%)

Output: ~/Desktop/context-engineering.md (4,821 bytes, 142 lines)

Cross-Run Analysis

For systematic analysis across multiple runs, filter runs.jsonl directly:

import json

runs = [json.loads(l) for l in open('runs.jsonl')]

# Depth score trend over time:
depth_over_time = [
    (r['timestamp'], r['wiggum_scores'].get('r1', {}).get('depth', 0))
    for r in runs
]

# Runs that needed revision (rounds > 1) by task type:
need_revision = {
    task_type: sum(1 for r in runs
                   if r['task_type'] == task_type and r['wiggum_rounds'] > 1)
    for task_type in ['enumerated', 'best_practices', 'research']
}

Or use analytics.py for the standard cross-run report:

python analytics.py         # mean scores by task type + model
python analytics.py --full  # per-run table