Harness Engineering for AI Agents · Context Engineering & Memory

The Planning Stage

10 min read
By the end of this reading you will be able to:
  • Explain how the planner classifies task type and complexity and why these classifications affect downstream pipeline behavior
  • Describe how memory context from prior runs is incorporated into the planning prompt and why this reduces redundant research
  • Trace how a Plan dataclass is produced from a task string and used by the research and synthesis stages

Why Plan Before Searching?

The naive approach to agentic research is: take the task string, turn it directly into a search query, search, synthesize. This produces mediocre results because a task string like "research cost envelope management best practices for production AI agents and save to ~/Desktop/output.md" is a poor search query. It is too long, too specific in the wrong ways, and too vague in the right ones.

The planning stage separates task understanding from task execution. A dedicated model pass analyzes the task, classifies it, extracts what is already known, and produces targeted search queries. The research stage then executes against those queries rather than against the raw task string.

The Plan Dataclass

@dataclass
class Plan:
    task_type: str           # "enumerated" | "best_practices" | "research"
    complexity: str          # "low" | "medium" | "high"
    search_queries: list     # 2+ targeted search strings
    prior_work: str          # summary of what memory already covers
    expected_sections: list  # expected output structure
    subtasks: list           # for orchestrator compound tasks

The planner model (glm4:9b) produces this dataclass from two inputs: the task string and the memory context retrieved in the previous stage.

Task Type Classification

The task_type field drives two important downstream behaviors:

enumerated tasks have an explicit count constraint: "find the top 5 techniques", "identify the 3 most common failure modes". For these tasks, the harness activates count-checking after synthesis — it verifies that exactly N items were produced and retries once if not. The Wiggum evaluator also applies a completeness criterion tied to the count.

best_practices tasks ask for recommendations without a count constraint: "what are the best practices for X". The synthesis instruction leans toward concrete implementation guidance. Wiggum checks for actionable steps per recommendation rather than item count.

research tasks ask for synthesis across sources: "survey the literature on X", "compare approaches A, B, and C". Wiggum checks for multi-source integration and appropriate uncertainty acknowledgment.

Classification errors propagate through the entire run. A best_practices task misclassified as enumerated will trigger the count check on output that was never intended to enumerate exactly N items, causing spurious retries.

Complexity Classification

The complexity field drives skill auto-activation. When complexity == "high", the /panel skill auto-activates — the output gets a full 3-persona parallel evaluation in addition to the standard Wiggum loop. High-complexity tasks also influence how many search rounds the research loop runs before applying the saturation gate.

# From skills.py — auto-activation predicate for /panel
def auto_trigger(task: str, plan: Plan) -> bool:
    return plan.complexity == "high"

Memory Context Injection

Before calling the planner, the harness retrieves relevant observations from memory:

memory_context = memory.get_context(task, top_k=3)
plan = make_plan(task, memory_context=memory_context)

The memory context is embedded in the planning prompt:

Previous work on related topics:
{memory_context}

Task: {task}

Classify the task type, assess complexity, generate 2-3 targeted search queries,
and note what the previous work already covers so we don't repeat it.

The planner's prior_work output tells the research stage which aspects of the topic are already covered. This prevents the search loop from re-fetching information already in memory and injecting it redundantly into the synthesis context.

Search Query Generation

The planner generates 2–3 targeted search queries rather than one. For a task like "explain the top 5 context engineering techniques for production LLM agents", the planner might generate:

["context engineering LLM agents production techniques 2024",
 "retrieval augmented generation context window management",
 "chain of thought few shot prompting production deployment"]

These queries are more focused than the raw task string and are designed to fetch complementary information — the first hits survey papers and blog posts, the second hits RAG-specific sources, the third hits prompting-specific sources. The research stage runs these queries sequentially, assessing the novelty of each round before continuing.

Testing the Planner in Isolation

Because the planner is a separate module, it can be tested independently:

python planner.py "Search for the top 5 context engineering techniques and save to output.md"

This prints the full Plan object without running any subsequent stages. Useful for debugging classification errors or evaluating query quality before committing to a full run.