Harness Engineering for AI Agents · The Harness Thesis

Pipeline Architecture

15 min read
By the end of this reading you will be able to:
  • Trace every stage of the single-focus run lifecycle from parse_skills() through compress_and_store() and name the responsible file for each
  • Distinguish between single-focus tasks (agent.py) and compound tasks (orchestrator.py) and explain when to use each
  • Explain the role of each component file in the harness and the data it consumes and produces

Two Task Types

The harness handles two fundamentally different kinds of work:

Single-focus tasks are routed through agent.py. A single-focus task has one clear deliverable: research a topic, produce a document, annotate an abstract. The run lifecycle below describes single-focus tasks.

Compound tasks are routed through orchestrator.py. A compound task — "research agent failure modes and context engineering, synthesize into a unified guide" — is decomposed into subtasks, each run through the single-focus pipeline in parallel, and then assembled into a final document. The orchestrator is a coordination layer on top of the agent, not a separate system.

The Single-Focus Run Lifecycle

parse_skills()                  skills.py
  → memory.get_context()        memory.py
    → make_plan()               planner.py
      → auto_activate()         skills.py
        → gather_research()     agent.py
            web_search_raw()      search_cache → DDGS
            compress_knowledge()  rolling LLM compression
            read_file_context()   chunker + MarkItDown
            enrich_with_page_content()  URL enrichment
          → synthesize()        agent.py (producer model)
            → count check + retry
              → write output
                → wiggum_loop() wiggum.py
                    → run_panel()  panel.py (post_wiggum skills)
                      → post_synthesis skills  skills.py
                        → compress_and_store()  memory.py

Each arrow is a function call. Each indentation level is a nested call or sub-stage. Let's walk through what happens at each stage.

Stage 1: parse_skills()

File: skills.py

The task string may begin with /skill tokens: /annotate /deep Research RAG techniques and save to output.md. parse_skills() strips these tokens from the task string and returns both the clean task and the set of explicitly activated skills. This happens before any model call — skills affect every subsequent stage.

Stage 2: memory.get_context()

File: memory.py

Before planning, the harness retrieves relevant observations from prior runs. The memory system maintains two indices: a ChromaDB vector store for semantic similarity retrieval and a SQLite FTS5 index for keyword matching. The combined context — typically 3–5 relevant past observations — is injected into the planning prompt. This lets the planner avoid re-researching topics the agent has already covered.

Stage 3: make_plan()

File: planner.py

The planner model (glm4:9b by default) analyzes the task and produces a Plan dataclass:

@dataclass
class Plan:
    task_type: str        # "enumerated", "best_practices", "research"
    complexity: str       # "low", "medium", "high"
    search_queries: list  # 2+ targeted search strings
    prior_work: str       # what memory already covers
    expected_sections: list  # expected output structure

The task_type field matters downstream: the Wiggum evaluator applies different criteria to enumerated tasks (must hit the specified count) versus best-practices tasks (must cover practical implementation) versus research tasks (must integrate multiple sources).

Stage 4: auto_activate()

File: skills.py

Some skills activate automatically based on task content or plan properties. /deep fires when the task mentions "comprehensive", "exhaustive", or "deep dive". /panel fires when plan.complexity == "high". /annotate fires when the task mentions "paper", "abstract", or "survey". Auto-activation happens after planning so that the plan's complexity assessment can trigger skills.

Stage 5: gather_research()

File: agent.py

This is the most complex stage. The research loop:

  1. Checks if a cached research context exists (RESEARCH_CACHE=1 env flag)
  2. Runs web searches (DDGS) against the planner's queries, with a 24h SQLite TTL cache
  3. Assesses novelty of each search round and stops when new results add little new information
  4. Compresses results into a rolling knowledge state after each accepted round
  5. Reads any file paths detected in the task string (using the chunker for large files)
  6. Enriches the top novel URLs with full page content via MarkItDown

The output is a single string of formatted research context passed to synthesis.

Stage 6: synthesize()

File: agent.py (calls the producer model)

The synthesis call assembles:

  • The task description
  • Retrieved memory context (from stage 2)
  • The research context (from stage 5)
  • Skill-injected prompts (pre_synthesis hooks)
  • The synthesis instruction (the target of autoresearch optimization in Module 5)

The producer model returns Markdown. For enumerated tasks, the harness checks whether the correct count was produced and retries once if not.

Stage 7: wiggum_loop()

File: wiggum.py

The output enters the evaluate → revise → verify loop. The evaluator model scores the output across five dimensions (covered in detail in Module 3). If the score is below threshold, the evaluator provides structured feedback, the producer revises, and the loop repeats. Up to 3 rounds. A final PASS/FAIL determination is recorded.

Stage 8: compress_and_store()

File: memory.py

After a successful run, the planner model compresses the run into a structured observation — a title, a narrative paragraph, and a list of key facts — and stores it in both the SQLite and ChromaDB indices. Future runs on related topics will retrieve this observation in stage 2.

Component Roles

File Role Input Output
agent.py Main entry point; stage orchestration Task string Written file + run record
planner.py Task analysis Task + memory context Plan dataclass
memory.py Persistent observation store Query / Observation Context string / stored row
wiggum.py Evaluate → revise → verify Draft output + task Scored, revised output
skills.py Skill registry and injection Task / pipeline stage Modified prompts / behavior
chunker.py Large-doc context extraction File path Chunked context string
logger.py Structured run logging Stage events runs.jsonl + trace JSON
security.py Code and injection scanning Code / search results Sanitized inputs
orchestrator.py Compound task coordination Compound task Assembled document
inference.py Backend shim Model call Ollama/vLLM response

The Shared Log

Every run appends a record to runs.jsonl. Every stage span appends to a Chrome Trace Event JSON file in traces/. These two files are the ground truth for all analysis — experiments compare runs.jsonl entries; Perfetto visualizes traces/ files. Nothing exists that isn't logged.