Harness Engineering for AI Agents · Verification & Failure Modes

The Wiggum Loop

15 min read

By the end of this reading you will be able to:

Implement the evaluate → revise → verify loop with configurable max rounds and a PASS threshold, logging scores per round
Explain why wiggum_r1 (first-pass score) is privileged over final score in the composite metric and what this incentivizes in harness design
Describe how structured evaluator feedback is extracted from the evaluator's response and injected into the revision prompt
Explain the three loop safeguards — hallucination stub detection, cycling detection, and best-round restoration — and identify the failure mode each prevents

The Name

The loop is named after the Wiggum family from The Simpsons — both of them.

Chief Wiggum is what the loop aspires to be: a legitimate authority with structured procedure, the badge that makes revision mandatory. He follows the rubric, files the report, demands you fix it.

Ralph Wiggum is what it sometimes is. Ralph's defining quality isn't that he's wrong — he's often observing something real. It's that his observation isn't tightly coupled to what actually changed. You revise the essay; Ralph returns the same read. Not because he's cycling mechanically, but because he didn't quite register that something shifted.

The evaluator fails the same way. "Depth is insufficient" in round 2, after the producer added a full implementation section — sincere, locally plausible, not tracking the revision. The loop is named Wiggum because it reaches for Chief and occasionally delivers Ralph. The three safeguards described later in this reading exist precisely because you cannot always tell which Wiggum showed up.

Structure of the Loop

WIGGUM_MAX_ROUNDS = 3
WIGGUM_PASS_THRESHOLD = 9.0  # out of 10

def wiggum_loop(task, output, task_type, producer_model, evaluator_model, trace):
    scores_by_round = []
    _best_score = -1.0
    _best_content = output

    for round_num in range(1, WIGGUM_MAX_ROUNDS + 1):
        # Evaluate
        eval_result = evaluate(output, task, task_type, evaluator_model)
        score = eval_result["weighted_score"]
        issues = eval_result["issues"]  # list of specific problems

        scores_by_round.append({
            "round": round_num,
            "score": score,
            "dims": eval_result["dimensions"]
        })
        log(f"[wiggum r{round_num}] score={score:.1f} issues={len(issues)}")

        # Track best content across rounds
        if score > _best_score:
            _best_score = score
            _best_content = output

        # PASS: score meets threshold
        if score >= WIGGUM_PASS_THRESHOLD:
            log(f"[wiggum] PASS at round {round_num}")
            return _best_content, scores_by_round, "PASS"

        # Final round: FAIL, but return best content seen
        if round_num == WIGGUM_MAX_ROUNDS:
            log(f"[wiggum] FAIL after {WIGGUM_MAX_ROUNDS} rounds")
            return _best_content, scores_by_round, "FAIL"

        # Cycling detection: if all dimension scores are identical to last round, stop
        if round_num > 1:
            prev_dims = scores_by_round[-2]["dims"]
            if all(eval_result["dimensions"][k] == prev_dims[k]
                   for k in eval_result["dimensions"] if k != "issues"):
                log("[wiggum] cycling detected — terminating early")
                return _best_content, scores_by_round, "FAIL"

        # Revise: inject evaluator feedback into revision prompt
        revision_prompt = build_revision_prompt(task, output, issues)
        output = call_producer(revision_prompt, producer_model)

    return _best_content, scores_by_round, "FAIL"

The loop terminates in four ways:

PASS — score ≥ 9.0 at any round
FAIL — max rounds exhausted without reaching threshold
FAIL (cycling) — all dimension scores identical to the previous round
ERROR — evaluator or producer call fails (logged separately)

The Evaluation Call

EVAL_PROMPT = """\
You are evaluating the following output for quality.

Task: {task}
Task type: {task_type}

Output to evaluate:
{output}

Score on a scale of 0-10 for each dimension:
- depth (0-10): Are concrete implementation steps, working examples, and specific configs provided?
- relevance (0-10): Does the output address the task correctly and stay on topic?
- completeness (0-10): Are all required items or aspects covered?
- grounded (0-10): Are claims supported by the research? Does the output avoid asserting facts not present in the source material?
- specificity (0-10): Are named tools, versions, and commands used (not vague generics)?
- structure (0-10): Is the output clearly organized with appropriate headers?

Also list specific issues that prevent a higher score.

Respond as JSON:
{{"depth": N, "relevance": N, "completeness": N, "grounded": N, "specificity": N, "structure": N,
  "issues": ["issue 1", "issue 2", ...]}}"""

def evaluate(output, task, task_type, evaluator_model):
    prompt = EVAL_PROMPT.format(task=task, task_type=task_type, output=output[:6000])
    response = ollama.chat(
        model=evaluator_model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0, "num_predict": 600}
    )
    raw = response["message"]["content"]
    dims = json.loads(extract_json(raw))  # dirtyjson fallback if needed

    weighted = (
        dims["depth"]        * 0.25 +
        dims["relevance"]    * 0.20 +
        dims["completeness"] * 0.20 +
        dims["grounded"]     * 0.15 +
        dims["specificity"]  * 0.10 +
        dims["structure"]    * 0.10
    )
    return {"weighted_score": weighted, "dimensions": dims, "issues": dims["issues"]}

Temperature is set to 0 for the evaluator — evaluation should be deterministic, not creative. The output is truncated at 6,000 characters before being sent to the evaluator; the reading after this one explains what happens to outputs that exceed that limit.

Building the Revision Prompt

def build_revision_prompt(task, output, issues):
    issue_list = "\n".join(f"- {issue}" for issue in issues)
    return f"""\
Revise the following output to address these specific issues:

{issue_list}

Original task: {task}

Current output:
{output}

Produce the complete revised output. Do not refer to the issues list — just fix them.
Output ONLY valid Markdown starting with a # heading."""

The revision prompt is intentionally simple. The evaluator has already done the work of identifying what is wrong; the producer's job is to fix it. Giving the producer the full issue list (not just a general "improve this") produces targeted revisions rather than wholesale rewrites.

Loop Safeguards

Three checks prevent the loop from wasting resources or silently producing degraded output.

Hallucination Stub Detection

Before the weighted score is computed, the harness scans the output for invented API surface — method calls on non-whitelisted objects whose names are twelve or more characters long. This is a reliable proxy for names the model fabricated rather than recalled from real libraries:

import re

def _count_stub_blocks(output: str) -> int:
    # Matches patterns like: some_object.very_long_fabricated_method_name(
    matches = re.findall(r'\b\w+\.([a-z_]{12,})\s*\(', output)
    return min(len(matches), 2)  # cap penalty at 2 points

# Applied before weighted average:
stub_penalty = _count_stub_blocks(output)
dims["depth"] = max(0, dims["depth"] - stub_penalty)

A single detected stub docks the depth score by one point; two or more docks it by two. The cap prevents catastrophic scoring on documents with many legitimate long method names. This heuristic catches model-invented API calls like client.generate_contextual_embedding_chain() but not short fabricated names.

Cycling Detection

If all six dimension scores in the current round are identical to those in the previous round, the model has stopped making substantive changes:

if round_num > 1:
    prev_dims = scores_by_round[-2]["dims"]
    if all(eval_result["dimensions"][k] == prev_dims[k]
           for k in eval_result["dimensions"] if k != "issues"):
        log("[wiggum] cycling detected — terminating early")
        return _best_content, scores_by_round, "FAIL"

Cycling happens when revisions are cosmetic — rephrasing a placeholder rather than replacing it with real implementation. Continuing additional rounds does not help; the check terminates early and saves the cost of the remaining rounds.

Best-Round Restoration

Revision rounds occasionally degrade the output: the producer addresses the cited issues but introduces new ones. Without tracking, a run that peaks at 8.7 in round 2 and regresses to 8.3 in round 3 would store the 8.3 version. The loop tracks _best_content and _best_score across all rounds and returns the best content seen regardless of when it occurred — including on FAIL outcomes, where the best intermediate output reaches memory storage rather than the final (potentially worse) revision.

The wiggum_r1 Metric

The composite score formula (covered in the next reading) weights the first-pass score more heavily than the final score:

composite = 0.7 * mean_wiggum_r1 + 0.3 * criteria_rate * 10

This is deliberate. A harness that scores 9.0 on round 1 is better than one that scores 7.0 on round 1 and 9.0 on round 2 — even though both end at 9.0. The second harness required a revision round, which costs tokens and latency. The metric rewards getting it right the first time, which is the correct optimization target for a production system.

The implication for harness design: the right place to invest engineering effort is in the research and synthesis stages (improving what goes into the model), not in adding more revision rounds (compensating for weak output after the fact).

Logging Scores per Round

All per-round scores are written to runs.jsonl:

"wiggum_scores": {
  "r1": {"depth": 7, "relevance": 8, "completeness": 7, "grounded": 6, "specificity": 6, "structure": 9},
  "r2": {"depth": 8, "relevance": 9, "completeness": 8, "grounded": 8, "specificity": 8, "structure": 9}
},
"wiggum_rounds": 2,
"final_score": 8.55,
"final": "PASS"

This per-dimension, per-round breakdown is what enables the failure taxonomy analysis in the final reading of this module: by aggregating issues across hundreds of runs, you can identify which dimensions fail most often and at what score level.

Previous Next →

The Wiggum Loop

The Name

Structure of the Loop

The Evaluation Call

Building the Revision Prompt

Loop Safeguards

Hallucination Stub Detection

Cycling Detection

Best-Round Restoration

The wiggum_r1 Metric

Logging Scores per Round

Privacy Policy

What we collect

What we don't collect

Your choices

Contact