Why External Verification
- Identify three categories of model self-report that cannot be trusted and explain the harness-side check that replaces each
- Explain why model-generated evaluation of model-generated output is unreliable and how external evaluation breaks the circularity
- Implement a harness-side count check that verifies enumerated output meets its count constraint without relying on the model's self-report
The Trust Problem
The second harness engineering design principle is: verify externally at every stage boundary. This principle exists because language models are confidently wrong in ways that are invisible without external checking.
Three specific failure patterns make this concrete:
File existence. The model says: "I have saved the output to ~/Desktop/output.md." Check with os.path.exists(). In early harness runs, the model wrote output to a path it hallucinated rather than the path specified in the task. The model reported success. The file did not exist. Without the check, the run logged as PASS on a non-existent output.
Token counts. The model says: "The output is approximately 500 tokens." Read the total_tokens field from the response's usage metadata. Models systematically underestimate their own token consumption by 30–50% — not from deception but from imprecision in self-estimation. Token accounting based on model self-report produces systematically wrong cost calculations.
Item counts. The model says: "Here are the top 5 context engineering techniques" and then produces 4, or 6, or 3. It is not lying — it lost count. The harness counts the actual items in the output using Python:
def count_top_level_items(markdown: str) -> int:
# Count ## headings as items in enumerated output
return len([line for line in markdown.split('\n')
if line.startswith('## ')])
def count_check(output: str, expected_count: int) -> bool:
actual = count_top_level_items(output)
return actual == expected_count
If the count is wrong, the harness requests a revision — before the output ever reaches the Wiggum evaluator.
Categories of Untrusted Self-Report
The failure patterns above belong to three general categories:
| Category | Model claim | External check |
|---|---|---|
| Behavioral | "I wrote the file" | os.path.exists(path) |
| Quantitative | "The output has N items / K tokens" | Python count / response.usage |
| Qualitative | "This output is high quality" | Independent evaluator model |
The third category — qualitative self-assessment — is the hardest to handle with a Python check. This is where the Wiggum loop comes in: a different model evaluates the output against explicit criteria.
Why Qualitative Self-Evaluation Fails
A model evaluating its own output reproduces the same systematic tendencies it used to produce it:
- If the model tends to overuse hedging language, it will score hedging language positively
- If the model was trained on data that rewards confident-sounding claims, it will score confidence positively even when specificity is lacking
- If the model produces placeholder implementations ("here you would call your API"), it will score those placeholders as meeting the depth criterion because it does not perceive the gap between a placeholder and actual implementation
This is not a failure of intelligence — it is a structural property of the same weights producing both output and evaluation. External evaluation with a different model (different weights, different training) breaks this circularity.
Experiment 3 quantified the effect: upgrading the evaluator from glm4:9b to Qwen3-Coder:30b (a model from a different family, 3x larger) improved mean composite score by ~1.2 points — not because the producer changed, but because the evaluator caught failures the smaller model missed.
The Count Check in Practice
def synthesize_with_count_check(task, context, plan, producer_model):
output = call_producer(task, context, plan)
if plan.task_type == "enumerated" and plan.count_constraint:
actual = count_top_level_items(output)
expected = plan.count_constraint
if actual != expected:
log(f"[count check] expected {expected}, got {actual} — retrying")
retry_prompt = (
f"Your output has {actual} items. The task requires exactly {expected}. "
f"Please revise to have exactly {expected} top-level sections."
)
output = call_producer(task + "\n\n" + retry_prompt, context, plan)
return output
The retry is a single attempt, not a loop — if the second attempt also gets the count wrong, it proceeds to Wiggum which will flag it as a completeness failure. The count check eliminates the most obvious structural errors before expensive evaluation.
Checking Behavioral Claims
def verify_output_written(output_path: str, min_size_bytes: int = 100) -> bool:
if not os.path.exists(output_path):
log(f"[verify] file not found: {output_path}")
return False
size = os.path.getsize(output_path)
if size < min_size_bytes:
log(f"[verify] file too small: {size} bytes")
return False
return True
File existence and minimum size are checked after every synthesis stage that is supposed to produce a file. A file smaller than 100 bytes is almost certainly empty or a stub — caught and flagged before the run logs as PASS.