Harness Engineering for AI Agents · The Harness Thesis

Pipeline Architecture

15 min read

By the end of this reading you will be able to:

Trace every stage of the single-focus run lifecycle from parse_skills() through compress_and_store() and name the responsible file for each
Distinguish between single-focus tasks (agent.py) and compound tasks (orchestrator.py) and explain when to use each
Explain the role of each component file in the harness and the data it consumes and produces

Two Task Types

The harness handles two fundamentally different kinds of work:

Single-focus tasks are routed through agent.py. A single-focus task has one clear deliverable: research a topic, produce a document, annotate an abstract. The run lifecycle below describes single-focus tasks.

Compound tasks are routed through orchestrator.py. A compound task — "research agent failure modes and context engineering, synthesize into a unified guide" — is decomposed into subtasks, each run through the single-focus pipeline in parallel, and then assembled into a final document. The orchestrator is a coordination layer on top of the agent, not a separate system.

The Single-Focus Run Lifecycle

parse_skills()                  skills.py
  → memory.get_context()        memory.py
    → make_plan()               planner.py
      → auto_activate()         skills.py
        → gather_research()     agent.py
            web_search_raw()      search_cache → DDGS
            compress_knowledge()  rolling LLM compression
            read_file_context()   chunker + MarkItDown
            enrich_with_page_content()  URL enrichment
          → synthesize()        agent.py (producer model)
            → count check + retry
              → write output
                → wiggum_loop() wiggum.py
                    → run_panel()  panel.py (post_wiggum skills)
                      → post_synthesis skills  skills.py
                        → compress_and_store()  memory.py

Each arrow is a function call. Each indentation level is a nested call or sub-stage. Let's walk through what happens at each stage.

Stage 1: parse_skills()

File: skills.py

The task string may begin with /skill tokens: /annotate /deep Research RAG techniques and save to output.md. parse_skills() strips these tokens from the task string and returns both the clean task and the set of explicitly activated skills. This happens before any model call — skills affect every subsequent stage.

Stage 2: memory.get_context()

File: memory.py

Before planning, the harness retrieves relevant observations from prior runs. The memory system maintains two indices: a ChromaDB vector store for semantic similarity retrieval and a SQLite FTS5 index for keyword matching. The combined context — typically 3–5 relevant past observations — is injected into the planning prompt. This lets the planner avoid re-researching topics the agent has already covered.

Stage 3: make_plan()

File: planner.py

The planner model (glm4:9b by default) analyzes the task and produces a Plan dataclass:

@dataclass
class Plan:
    task_type: str        # "enumerated", "best_practices", "research"
    complexity: str       # "low", "medium", "high"
    search_queries: list  # 2+ targeted search strings
    prior_work: str       # what memory already covers
    expected_sections: list  # expected output structure

The task_type field matters downstream: the Wiggum evaluator applies different criteria to enumerated tasks (must hit the specified count) versus best-practices tasks (must cover practical implementation) versus research tasks (must integrate multiple sources).

Stage 4: auto_activate()

File: skills.py

Some skills activate automatically based on task content or plan properties. /deep fires when the task mentions "comprehensive", "exhaustive", or "deep dive". /panel fires when plan.complexity == "high". /annotate fires when the task mentions "paper", "abstract", or "survey". Auto-activation happens after planning so that the plan's complexity assessment can trigger skills.

Stage 5: gather_research()

File: agent.py

This is the most complex stage. The research loop:

Checks if a cached research context exists (RESEARCH_CACHE=1 env flag)
Runs web searches (DDGS) against the planner's queries, with a 24h SQLite TTL cache
Assesses novelty of each search round and stops when new results add little new information
Compresses results into a rolling knowledge state after each accepted round
Reads any file paths detected in the task string (using the chunker for large files)
Enriches the top novel URLs with full page content via MarkItDown

The output is a single string of formatted research context passed to synthesis.

Stage 6: synthesize()

File: agent.py (calls the producer model)

The synthesis call assembles:

The task description
Retrieved memory context (from stage 2)
The research context (from stage 5)
Skill-injected prompts (pre_synthesis hooks)
The synthesis instruction (the target of autoresearch optimization in Module 5)

The producer model returns Markdown. For enumerated tasks, the harness checks whether the correct count was produced and retries once if not.

Stage 7: wiggum_loop()

File: wiggum.py

The output enters the evaluate → revise → verify loop. The evaluator model scores the output across five dimensions (covered in detail in Module 3). If the score is below threshold, the evaluator provides structured feedback, the producer revises, and the loop repeats. Up to 3 rounds. A final PASS/FAIL determination is recorded.

Stage 8: compress_and_store()

File: memory.py

After a successful run, the planner model compresses the run into a structured observation — a title, a narrative paragraph, and a list of key facts — and stores it in both the SQLite and ChromaDB indices. Future runs on related topics will retrieve this observation in stage 2.

Component Roles

File	Role	Input	Output
`agent.py`	Main entry point; stage orchestration	Task string	Written file + run record
`planner.py`	Task analysis	Task + memory context	`Plan` dataclass
`memory.py`	Persistent observation store	Query / Observation	Context string / stored row
`wiggum.py`	Evaluate → revise → verify	Draft output + task	Scored, revised output
`skills.py`	Skill registry and injection	Task / pipeline stage	Modified prompts / behavior
`chunker.py`	Large-doc context extraction	File path	Chunked context string
`logger.py`	Structured run logging	Stage events	`runs.jsonl` + trace JSON
`security.py`	Code and injection scanning	Code / search results	Sanitized inputs
`orchestrator.py`	Compound task coordination	Compound task	Assembled document
`inference.py`	Backend shim	Model call	Ollama/vLLM response

The Shared Log

Every run appends a record to runs.jsonl. Every stage span appends to a Chrome Trace Event JSON file in traces/. These two files are the ground truth for all analysis — experiments compare runs.jsonl entries; Perfetto visualizes traces/ files. Nothing exists that isn't logged.

References

ollama-pi-harness — ollama-pi-harness — source code for all components described in this reading

Previous Next →