Harness Engineering for AI Agents · Verification & Failure Modes

The Evaluation Rubric

12 min read

By the end of this reading you will be able to:

Compute a weighted Wiggum score from raw dimension scores across all six dimensions and derive the composite evaluation score
Justify the relative weights of the six rubric dimensions by explaining what each captures, why depth is still weighted most heavily, and what motivated adding the grounded dimension
Distinguish task-type-specific evaluation criteria for enumerated, best_practices, and research tasks and explain why a single rubric cannot cover all three

The Six Dimensions

The Wiggum rubric scores output across six dimensions, each with a specific weight:

Dimension	Weight	What it measures
Depth	0.25	Are concrete implementation steps, examples, and specifics provided?
Relevance	0.20	Does the output address the task? Is content on-topic?
Completeness	0.20	Are all required items or aspects present?
Grounded	0.15	Are claims supported by the research context? Does the output avoid asserting facts not in the source material?
Specificity	0.10	Are named tools, versions, commands, and numbers used (not vague generics)?
Structure	0.10	Is the output organized with appropriate headings and readable flow?

The weighted score:

weighted = (
    dims["depth"]        * 0.25 +
    dims["relevance"]    * 0.20 +
    dims["completeness"] * 0.20 +
    dims["grounded"]     * 0.15 +
    dims["specificity"]  * 0.10 +
    dims["structure"]    * 0.10
)

Weight Rationale

Depth (0.25) remains the highest-weighted dimension because it is the hardest to achieve and the most valuable in a knowledge synthesis context. A high-depth output includes working code examples, concrete implementation steps, specific configuration values, and real-world usage scenarios. A low-depth output describes concepts correctly but stops short of implementation — the reader knows what to do but not how to do it. Depth is the dimension where models most consistently fail, accounting for over 80% of the top failure clusters in the taxonomy.

Relevance (0.20) and Completeness (0.20) share second weight. Relevance penalizes drift — models rarely go completely off-topic but do spend paragraphs on loosely related concepts. Completeness penalizes missing items, which are easy to overlook in a long document but hard to compensate for after the fact.

Grounded (0.15) was added after production runs revealed a failure mode distinct from specificity: synthesizing across multiple sources, models would occasionally assert confident-sounding claims that appeared in none of the retrieved documents — plausible extrapolations presented as reported findings. Grounded is not a factual accuracy check (the evaluator cannot verify ground truth) but a source-fidelity check: does the output represent what the research actually said? A claim like "Redis achieves 1M ops/sec in this configuration" should be traceable to a source; "caching reduces latency" is grounded even without citation. When this dimension was added, depth's weight was reduced from 0.30 to 0.25 — the two dimensions now address distinct failure modes rather than depth trying to cover both implementation quality and citation integrity.

Specificity (0.10) is closely related to depth but distinct: depth asks whether implementation steps exist; specificity asks whether those steps use real names. "Use a caching library" is deep but not specific. "Use Redis with a 24h TTL in your DDGS search wrapper" is both. Its weight is lower now that grounded handles the related concern about unsubstantiated assertions.

Structure (0.10) has the lowest weight because it is the easiest dimension to satisfy and the least differentiating. Most model output is adequately structured; the rubric penalizes truly disorganized output but does not reward elaborate formatting over substance.

The Composite Score

The composite score that drives the autoresearch keep rule and experiment comparisons is not the Wiggum weighted score alone:

def compute_composite(runs: list[dict]) -> float:
    mean_wiggum_r1 = sum(r["wiggum_scores"]["r1"]["weighted"] for r in runs) / len(runs)
    criteria_pass  = sum(1 for r in runs if r["criteria_check"] == "PASS")
    criteria_rate  = criteria_pass / len(runs)

    return 0.7 * mean_wiggum_r1 + 0.3 * criteria_rate * 10

Two components:

mean_wiggum_r1 (70%) — the average first-pass Wiggum score across all runs in the session. This rewards consistently high first-pass quality.

criteria_rate * 10 (30%) — the fraction of runs that pass hard criteria checks, scaled to 0–10. Hard criteria are Python-verified facts: the file was written, the output has the correct item count, the output is above a minimum line count. These are binary pass/fail, not rubric scores.

The 70/30 split means you cannot game the composite score by padding output to pass criteria while producing shallow content, or by producing deep content that fails to actually save the file.

Hard Criteria

For each eval task, hard criteria are defined in eval_suite.py:

EVAL_TASKS = [
    {
        "id": "T_A",
        "task": "Search for the top 5 context engineering techniques...",
        "task_type": "enumerated",
        "criteria": [
            lambda output, path: os.path.exists(path),        # file saved
            lambda output, path: count_lines(output) >= 15,   # min line count
            lambda output, path: count_h2(output) == 5,       # exactly 5 sections
        ]
    },
    {
        "id": "T_B",
        "task": "Search for best practices for cost envelope management...",
        "task_type": "best_practices",
        "criteria": [
            lambda output, path: os.path.exists(path),
            lambda output, path: count_lines(output) >= 15,
        ]
    },
    # ...
]

Criteria are pure Python functions — no model involvement. They are fast, deterministic, and impossible to fool.

Task-Type-Specific Evaluation

The base rubric is the same for all task types, but the evaluator prompt adds task-type-specific guidance:

Enumerated tasks add: "Verify that exactly N top-level sections are present (where N is the specified count). Mark completeness below 7 if the count is wrong."

Best-practices tasks add: "For each recommendation, check that a concrete implementation note is present — not just a description of what to do, but specific steps, commands, or code for how to do it."

Research tasks add: "Check that multiple distinct sources are integrated and that the output acknowledges tensions or gaps between them rather than presenting a single unified view."

Without these task-type additions, the evaluator applies a generic rubric that treats a 5-item enumeration the same as a best-practices survey — missing the count failure in one case and the implementation gap in the other.

Interpreting Dimension Scores

Across 1,500 logged runs, two dimensions are consistently the weakest:

Depth averages 7.1/10 across all task types — the most common failure is implementation notes that describe what a technique does without showing how to apply it
Specificity averages 6.9/10 — the most common failure is generic tool references ("use a vector database") instead of specific ones ("use ChromaDB with all-MiniLM-L6-v2 embeddings")

Grounded averages 7.4/10 — failures cluster around synthesis tasks where the model extrapolates a confident specific claim (a latency figure, a benchmark number) from general context that only implies it. Structure averages 8.9/10 and Relevance 8.7/10 — both rarely problematic. Completeness is bimodal: either high (8.5+) or very low (5.0–6.0 when count constraints are missed).

Previous Next →

The Evaluation Rubric

The Six Dimensions

Weight Rationale

The Composite Score

Hard Criteria

Task-Type-Specific Evaluation

Interpreting Dimension Scores

Privacy Policy

What we collect

What we don't collect

Your choices

Contact