Harness Engineering for AI Agents · Verification & Failure Modes

Failure Taxonomy

15 min read
By the end of this reading you will be able to:
  • Classify a described agent failure into the taxonomy (task drift, placeholder generation, count violation, encoding corruption, role ambiguity) and identify the harness-side countermeasure for each
  • Explain how the failure pattern analysis pipeline generates the taxonomy from runs.jsonl wiggum issue strings
  • Prescribe a harness-side intervention for a given failure pattern, explaining why the intervention operates at the harness level rather than relying on better prompting

Building the Taxonomy from Data

The failure taxonomy in this reading is not theoretical — it is derived empirically from 645 Wiggum issue strings collected across 1,500 logged runs. The failure_patterns.py script clusters these strings by keyword/bigram Jaccard similarity and generates the wiki/failure-patterns.md document automatically.

python failure_patterns.py
# Generates: wiki/failure-patterns.md
# 645 issues → 107 clusters

The five major failure classes below each represent a cluster of 10+ occurrences with an average Wiggum score below 7.5 at the time of issue detection.


Failure Class 1: Task Completion Drift

What happens: The agent researches the topic but does not produce the specified output artifact. It writes to a different path, produces a summary in the terminal rather than a file, or — in the most common variant — executes all research stages and then stops without calling the write step.

Diagnostic signal: final = ERROR or FAIL + output_bytes = 0 or file-not-found.

Root cause: The synthesis prompt does not sufficiently constrain the output action. The model interprets "save to ~/Desktop/output.md" as advisory rather than mandatory.

Harness-side countermeasure:

# After synthesis, before wiggum:
if not verify_output_written(output_path):
    # Force a targeted retry with explicit write instruction
    retry_prompt = (
        f"The output was not saved. Write the following content "
        f"to {output_path} now:\n\n{output}"
    )
    call_producer(retry_prompt)
    if not verify_output_written(output_path):
        log_and_fail("output not written after retry")

Failure Class 2: Placeholder Content Generation

What happens: The output contains real headings and structure but the content under each heading is generic or explicitly deferred: "here you would implement your RAG pipeline", "the specific implementation will depend on your use case", "[insert code example here]".

Diagnostic signal: Low depth score (5.0–7.0) + issue strings containing "lacks a concrete implementation note", "no implementation example", "too generic".

Root cause: The model resolves uncertainty by deferring to the reader rather than providing specific implementations. This is a training artifact — models learn that hedged, hedging language is safe.

Harness-side countermeasure:

# In synthesis prompt (via SYNTH_INSTRUCTION):
"""For each technique or concept:
- Provide a concrete implementation example with actual code
- Name specific tools, versions, or libraries (not generic placeholders)
- Include error handling in any code examples
- Do not defer to the reader — if the implementation requires choices, make them"""

This is also the primary target of autoresearch optimization — the synthesis instruction is tuned to push the model toward specificity and away from deferred implementations.


Failure Class 3: Count Constraint Violations

What happens: The task requests exactly N items; the output contains N-1, N+1, or occasionally N-3. Most commonly the model miscounts its own sections (produces 4 when 5 were requested) or adds an extra introductory section that inflates the count without adding a required item.

Diagnostic signal: Low completeness score (5.0–7.0) + issue string "section count is N, expected M".

Root cause: The model does not track section count during generation — it writes, and then the count is whatever it is.

Harness-side countermeasure:

def count_check_with_retry(output, expected_count, task, context, producer_model):
    actual = count_h2_sections(output)
    if actual == expected_count:
        return output

    log(f"[count check] expected {expected_count}, got {actual}")
    retry = call_producer(
        task + f"\n\nIMPORTANT: Your output must have exactly {expected_count} "
               f"sections (## headings). You produced {actual}. Revise now.",
        context, producer_model
    )
    return retry

The count check runs once before Wiggum. If the retry also fails, Wiggum will flag the completeness dimension — the producer gets a second chance during the revision round.


Failure Class 4: Encoding Corruption

What happens: The output contains garbled Unicode characters, broken LaTeX sequences, or malformed Markdown. Common manifestations: curly quotes converted to “, broken code blocks with unclosed backticks, or LaTeX \( sequences that appear literally because the Markdown renderer doesn't know they're math.

Diagnostic signal: Low structure score (5.0–7.0) + issue string "code example is incomplete (missing closing backticks)", "encoding artifacts present".

Root cause: Multiple encoding conversion steps in the pipeline (DDGS → UTF-8 → model tokenizer → output → file writer) each have opportunities to introduce encoding errors. Models also sometimes generate syntactically incomplete code blocks when they lose track of nesting depth.

Harness-side countermeasure:

import re

def sanitize_output(text: str) -> str:
    # Ensure code blocks are closed
    open_blocks = text.count('```') % 2
    if open_blocks:
        text += '\n```'

    # Fix common encoding artifacts
    text = text.replace('“', '"').replace('â€', '"')
    text = text.replace('’', "'").replace('‘', "'")

    return text

The sanitizer runs immediately after output is received, before the count check or Wiggum.


Failure Class 5: Role Ambiguity in Multi-Agent Systems

What happens: In compound task orchestration, subtask agents produce output that overlaps substantially with each other, or one subtask agent addresses material that belongs to a different subtask. The assembled document has redundant sections and gaps.

Diagnostic signal: Low completeness score on the assembled output + Wiggum issue "Section X covers material already addressed in Section Y".

Root cause: The orchestrator's subtask decomposition does not provide sufficient scope boundaries. Subtask prompts that say "research context engineering" leave the agent uncertain whether it should cover RAG, prompt chaining, or both — and it covers everything.

Harness-side countermeasure:

# Subtask prompt template in orchestrator.py
SUBTASK_PROMPT = """\
You are working on SUBTASK {i} of {total} in a compound research task.

Your specific scope: {scope_description}

Explicitly OUT OF SCOPE for your subtask (covered by other agents):
{out_of_scope}

Do not cover out-of-scope material. Focus exclusively on {scope_description}."""

Explicit out-of-scope boundaries reduce role ambiguity more effectively than positive scope descriptions alone — telling the agent what not to cover is more constraining than telling it what to cover.


Pattern Frequency Summary

Failure Class Occurrences Avg Wiggum Score Primary Dimension Affected
Implementation notes missing 56 7.1 Depth
Lacks concrete example 45 7.1 Depth + Specificity
Context compression missing 45 7.2 Depth
Incomplete code block 14 6.9 Structure
Concrete workflow missing 11 7.0 Depth

Depth failures dominate — 80%+ of the top failure clusters relate to insufficient implementation detail. This is what motivated the autoresearch focus on synthesis instructions that require "production-ready integration examples with full agent loop usage, error handling, and real-world scenarios."