Harness Engineering for AI Agents · Verification & Failure Modes

Why External Verification

10 min read

By the end of this reading you will be able to:

Identify three categories of model self-report that cannot be trusted and explain the harness-side check that replaces each
Explain why model-generated evaluation of model-generated output is unreliable and how external evaluation breaks the circularity
Implement a harness-side count check that verifies enumerated output meets its count constraint without relying on the model's self-report

The Trust Problem

The second harness engineering design principle is: verify externally at every stage boundary. This principle exists because language models are confidently wrong in ways that are invisible without external checking.

Three specific failure patterns make this concrete:

File existence. The model says: "I have saved the output to ~/Desktop/output.md." Check with os.path.exists(). In early harness runs, the model wrote output to a path it hallucinated rather than the path specified in the task. The model reported success. The file did not exist. Without the check, the run logged as PASS on a non-existent output.

Token counts. The model says: "The output is approximately 500 tokens." Read the total_tokens field from the response's usage metadata. Models systematically underestimate their own token consumption by 30–50% — not from deception but from imprecision in self-estimation. Token accounting based on model self-report produces systematically wrong cost calculations.

Item counts. The model says: "Here are the top 5 context engineering techniques" and then produces 4, or 6, or 3. It is not lying — it lost count. The harness counts the actual items in the output using Python:

def count_top_level_items(markdown: str) -> int:
    # Count ## headings as items in enumerated output
    return len([line for line in markdown.split('\n')
                if line.startswith('## ')])

def count_check(output: str, expected_count: int) -> bool:
    actual = count_top_level_items(output)
    return actual == expected_count

If the count is wrong, the harness requests a revision — before the output ever reaches the Wiggum evaluator.

Categories of Untrusted Self-Report

The failure patterns above belong to three general categories:

Category	Model claim	External check
Behavioral	"I wrote the file"	`os.path.exists(path)`
Quantitative	"The output has N items / K tokens"	Python count / response.usage
Qualitative	"This output is high quality"	Independent evaluator model

The third category — qualitative self-assessment — is the hardest to handle with a Python check. This is where the Wiggum loop comes in: a different model evaluates the output against explicit criteria.

Why Qualitative Self-Evaluation Fails

A model evaluating its own output reproduces the same systematic tendencies it used to produce it:

If the model tends to overuse hedging language, it will score hedging language positively
If the model was trained on data that rewards confident-sounding claims, it will score confidence positively even when specificity is lacking
If the model produces placeholder implementations ("here you would call your API"), it will score those placeholders as meeting the depth criterion because it does not perceive the gap between a placeholder and actual implementation

This is not a failure of intelligence — it is a structural property of the same weights producing both output and evaluation. External evaluation with a different model (different weights, different training) breaks this circularity.

Experiment 3 quantified the effect: upgrading the evaluator from glm4:9b to Qwen3-Coder:30b (a model from a different family, 3x larger) improved mean composite score by ~1.2 points — not because the producer changed, but because the evaluator caught failures the smaller model missed.

The Count Check in Practice

def synthesize_with_count_check(task, context, plan, producer_model):
    output = call_producer(task, context, plan)

    if plan.task_type == "enumerated" and plan.count_constraint:
        actual = count_top_level_items(output)
        expected = plan.count_constraint

        if actual != expected:
            log(f"[count check] expected {expected}, got {actual} — retrying")
            retry_prompt = (
                f"Your output has {actual} items. The task requires exactly {expected}. "
                f"Please revise to have exactly {expected} top-level sections."
            )
            output = call_producer(task + "\n\n" + retry_prompt, context, plan)

    return output

The retry is a single attempt, not a loop — if the second attempt also gets the count wrong, it proceeds to Wiggum which will flag it as a completeness failure. The count check eliminates the most obvious structural errors before expensive evaluation.

Checking Behavioral Claims

def verify_output_written(output_path: str, min_size_bytes: int = 100) -> bool:
    if not os.path.exists(output_path):
        log(f"[verify] file not found: {output_path}")
        return False
    size = os.path.getsize(output_path)
    if size < min_size_bytes:
        log(f"[verify] file too small: {size} bytes")
        return False
    return True

File existence and minimum size are checked after every synthesis stage that is supposed to produce a file. A file smaller than 100 bytes is almost certainly empty or a stub — caught and flagged before the run logs as PASS.

Overview Next →

Why External Verification

The Trust Problem

Categories of Untrusted Self-Report

Why Qualitative Self-Evaluation Fails

The Count Check in Practice

Checking Behavioral Claims

Privacy Policy

What we collect

What we don't collect

Your choices

Contact