Harness Engineering for AI Agents · Verification & Failure Modes

Why External Verification

10 min read
By the end of this reading you will be able to:
  • Identify three categories of model self-report that cannot be trusted and explain the harness-side check that replaces each
  • Explain why model-generated evaluation of model-generated output is unreliable and how external evaluation breaks the circularity
  • Implement a harness-side count check that verifies enumerated output meets its count constraint without relying on the model's self-report

The Trust Problem

The second harness engineering design principle is: verify externally at every stage boundary. This principle exists because language models are confidently wrong in ways that are invisible without external checking.

Three specific failure patterns make this concrete:

File existence. The model says: "I have saved the output to ~/Desktop/output.md." Check with os.path.exists(). In early harness runs, the model wrote output to a path it hallucinated rather than the path specified in the task. The model reported success. The file did not exist. Without the check, the run logged as PASS on a non-existent output.

Token counts. The model says: "The output is approximately 500 tokens." Read the total_tokens field from the response's usage metadata. Models systematically underestimate their own token consumption by 30–50% — not from deception but from imprecision in self-estimation. Token accounting based on model self-report produces systematically wrong cost calculations.

Item counts. The model says: "Here are the top 5 context engineering techniques" and then produces 4, or 6, or 3. It is not lying — it lost count. The harness counts the actual items in the output using Python:

def count_top_level_items(markdown: str) -> int:
    # Count ## headings as items in enumerated output
    return len([line for line in markdown.split('\n')
                if line.startswith('## ')])

def count_check(output: str, expected_count: int) -> bool:
    actual = count_top_level_items(output)
    return actual == expected_count

If the count is wrong, the harness requests a revision — before the output ever reaches the Wiggum evaluator.

Categories of Untrusted Self-Report

The failure patterns above belong to three general categories:

Category Model claim External check
Behavioral "I wrote the file" os.path.exists(path)
Quantitative "The output has N items / K tokens" Python count / response.usage
Qualitative "This output is high quality" Independent evaluator model

The third category — qualitative self-assessment — is the hardest to handle with a Python check. This is where the Wiggum loop comes in: a different model evaluates the output against explicit criteria.

Why Qualitative Self-Evaluation Fails

A model evaluating its own output reproduces the same systematic tendencies it used to produce it:

  • If the model tends to overuse hedging language, it will score hedging language positively
  • If the model was trained on data that rewards confident-sounding claims, it will score confidence positively even when specificity is lacking
  • If the model produces placeholder implementations ("here you would call your API"), it will score those placeholders as meeting the depth criterion because it does not perceive the gap between a placeholder and actual implementation

This is not a failure of intelligence — it is a structural property of the same weights producing both output and evaluation. External evaluation with a different model (different weights, different training) breaks this circularity.

Experiment 3 quantified the effect: upgrading the evaluator from glm4:9b to Qwen3-Coder:30b (a model from a different family, 3x larger) improved mean composite score by ~1.2 points — not because the producer changed, but because the evaluator caught failures the smaller model missed.

The Count Check in Practice

def synthesize_with_count_check(task, context, plan, producer_model):
    output = call_producer(task, context, plan)

    if plan.task_type == "enumerated" and plan.count_constraint:
        actual = count_top_level_items(output)
        expected = plan.count_constraint

        if actual != expected:
            log(f"[count check] expected {expected}, got {actual} — retrying")
            retry_prompt = (
                f"Your output has {actual} items. The task requires exactly {expected}. "
                f"Please revise to have exactly {expected} top-level sections."
            )
            output = call_producer(task + "\n\n" + retry_prompt, context, plan)

    return output

The retry is a single attempt, not a loop — if the second attempt also gets the count wrong, it proceeds to Wiggum which will flag it as a completeness failure. The count check eliminates the most obvious structural errors before expensive evaluation.

Checking Behavioral Claims

def verify_output_written(output_path: str, min_size_bytes: int = 100) -> bool:
    if not os.path.exists(output_path):
        log(f"[verify] file not found: {output_path}")
        return False
    size = os.path.getsize(output_path)
    if size < min_size_bytes:
        log(f"[verify] file too small: {size} bytes")
        return False
    return True

File existence and minimum size are checked after every synthesis stage that is supposed to produce a file. A file smaller than 100 bytes is almost certainly empty or a stub — caught and flagged before the run logs as PASS.