Harness Engineering for AI Agents · Verification & Failure Modes

Pre-evaluation Summarization

10 min read

By the end of this reading you will be able to:

Explain why a summarization step is inserted before each evaluator and revision call, and what failure modes it prevents
Distinguish the two summarization modes — section-preserving for evaluation and surgical for revision — and explain why each mode is shaped the way it is
Describe the pre-truncation step and explain the 85%/15% head-tail split strategy

The Problem

The Wiggum loop as described in the previous reading passes the output directly to the evaluator and, if revision is needed, directly to the producer. This works fine for short documents. In practice, production runs often generate 8,000–15,000-character outputs — and the evaluator is typically a smaller, faster model (an 8B model at an 8K context window) chosen precisely because it does not need production-grade capability, just independent judgment.

Two things go wrong when a long document reaches that evaluator without preprocessing:

Context overflow. A 12,000-character output consumes roughly 3,000 tokens. Combined with the evaluation prompt and task description, the evaluator's effective context is consumed and the model begins to lose track of earlier sections.
Signal dilution. Even within context limits, a document that runs to 15,000 characters asks the evaluator to hold and compare many more claims than it can reliably reason about. The result is superficial scores — typically inflated structure and relevance scores alongside underweighted depth failures, because the evaluator can see the headings but not audit every implementation claim.

A summarization step between each eval/revision call and the long document addresses both problems. The harness has a dedicated summarizer.py module that runs as a preprocessing stage inside the Wiggum loop, not as an external pipeline component.

Pre-truncation

Before the summarizer model is called at all, the document is pre-truncated if it exceeds a character cap:

_SUMMARIZER_INPUT_CAP = 20_000  # characters

def _pretruncate(text: str) -> str:
    if len(text) <= _SUMMARIZER_INPUT_CAP:
        return text
    head = int(_SUMMARIZER_INPUT_CAP * 0.85)
    tail = _SUMMARIZER_INPUT_CAP - head
    return text[:head] + "\n\n[... truncated ...]\n\n" + text[-tail:]

This step prevents the summarizer itself — which is typically the same small model used for compression elsewhere in the pipeline — from OOMing on very large inputs. The 85%/15% split preserves the introduction and conclusion, which are the highest-signal parts of a research document. An 85/15 split applied to a 30,000-character document retains the opening argument structure and the final recommendations while discarding the middle body, which is where supporting detail (already handled by the research context) lives.

Pre-truncation applies to the summarizer's input. The original full document is not modified.

Mode 1: Summarizing for the Evaluator

When called before an evaluation, the summarizer uses a section-preserving strategy:

SUMMARIZE_FOR_EVAL_PROMPT = """\
Summarize the following document for a quality evaluator.

Rules:
- Keep ALL second-level headings (## lines) exactly as written
- Under each heading, write one sentence capturing the key claim
- Append the last 400 characters of the original document verbatim
- Output must be under 4,000 characters total

Document:
{document}"""

The output the evaluator receives is a structural skeleton: every ## heading the model produced, a one-sentence claim per section, and the raw conclusion. This gives the evaluator what it needs to assess completeness (are all required sections present?), relevance (do the section topics address the task?), and structure (is the document organized correctly?). The verbatim tail ensures the conclusion — often where synthesis quality is highest or lowest — is seen in full.

The trade-off is that depth and specificity scores become somewhat weaker signals on long documents, because the evaluator is assessing one-sentence summaries of implementation sections rather than the implementations themselves. The rubric weight configuration (depth at 0.25 rather than a higher value) partially accounts for this: depth failures are still penalized but not catastrophically over-counted relative to what the evaluator actually saw.

Mode 2: Summarizing for the Revision Prompt

When called before sending the document to the producer for revision, the summarizer uses a surgical strategy:

SUMMARIZE_FOR_REVISION_PROMPT = """\
Prepare the following document for targeted revision.

The evaluator identified these specific issues:
{issue_list}

Rules:
- For sections that contain or relate to any issue above: keep VERBATIM
- For all other sections: condense to 2-3 sentences
- Do not add headings or commentary

Document:
{document}"""

The logic here is different from evaluation mode. The producer needs to fix specific problems — "Section 3 lacks a concrete implementation note for the Redis TTL configuration" — and to do that it needs to see the exact text it wrote, not a summary of it. A summary of a placeholder section is just a shorter placeholder; the producer cannot write a concrete fix without seeing the original gap.

Sections not mentioned in the issue list are condensed, which keeps the revision prompt within context limits while ensuring the full problematic sections are preserved verbatim.

Why Two Modes

The evaluator and the producer have opposite needs from the same long document:

	Evaluator	Producer (revision)
Needs	Coverage overview — all sections visible	Exact problem text — verbatim issue sections
Tolerates	One-sentence section summaries	Condensed non-issue sections
Cannot use	Verbatim everything (context overflow)	Structural skeleton (can't fix what it can't see)

Using a single summarization mode for both would force a compromise that serves neither well. Section-preserving mode fed to the producer gives it headings but not implementation text; surgical mode fed to the evaluator gives it the problem sections but may hide completeness failures in condensed non-problem sections.

Configuration

Two environment variables control when summarization activates:

SUMMARIZER_EVAL_THRESHOLD=6000    # characters; summarize before eval if output exceeds this
SUMMARIZER_REVISE_THRESHOLD=5000  # characters; summarize before revision if output exceeds this

The revision threshold is lower than the eval threshold because the revision prompt also includes the task description, the issue list, and the production instruction — so the combined context grows faster than for evaluation. Short documents (typical for 7B models or constrained tasks) bypass the summarizer entirely, which avoids adding latency and a model call to runs that don't need it.

Previous Next →

Pre-evaluation Summarization

The Problem

Pre-truncation

Mode 1: Summarizing for the Evaluator

Mode 2: Summarizing for the Revision Prompt

Why Two Modes

Configuration

Privacy Policy

What we collect

What we don't collect

Your choices

Contact