Harness Engineering for AI Agents · Self-Improvement

The Autoresearch Loop

12 min read

By the end of this reading you will be able to:

Explain the autoresearch keep rule and the delta threshold that determines whether a synthesis instruction change is committed or reverted
Describe the three-phase autoresearch loop (propose → test → evaluate) and the role of the research cache in making it computationally feasible
Analyze the Session 1 autoresearch findings to identify which synthesis instruction changes improved scores, which degraded them, and why the proposer clustered

The Idea

The synthesis instruction — the directive appended to every synthesis prompt — is the single highest-leverage parameter in the harness. It tells the model how to structure its output, what depth to aim for, and what traps to avoid. Manual tuning of this instruction is effective but slow: each candidate instruction requires running the full eval suite to measure its effect.

Autoresearch automates this loop. It is named for the idea, attributed to Andrej Karpathy, of a system that improves its own performance by evaluating and editing its own code or prompts.

The Three Phases

propose → test → evaluate/keep

Propose: An LLM generates a candidate change to the synthesis instruction, grounded in an analysis of recent failure patterns from runs.jsonl:

PROPOSE_PROMPT = """\
The current synthesis instruction is:
{current_instruction}

Recent failure patterns from evaluation:
{failure_summary}

Research context (from prompt engineering literature):
{research_context}

Propose ONE specific change to the synthesis instruction that would address
the most common failure pattern. The change should:
- Target depth or specificity (the weakest dimensions)
- Be a concrete addition or modification, not a vague suggestion
- Not make the instruction more restrictive in ways that reduce quality

Output ONLY the revised instruction, nothing else."""

Test: Run the eval suite (5 tasks, 3 replications each) with the proposed instruction. The research cache makes this feasible — search results are cached, so only the synthesis and Wiggum stages run. A full eval session takes ~30–45 minutes with the cache active.

Evaluate/Keep: Compute the composite score delta.

KEEP_THRESHOLD = 0.1  # composite score improvement required to keep

def evaluate_and_keep(baseline_score, candidate_score, instruction_change):
    delta = candidate_score - baseline_score

    if delta > KEEP_THRESHOLD:
        log(f"KEEP: delta={delta:+.3f} — committing change")
        # git add agent.py && git commit -m "autoresearch: {change_summary}"
        commit_change(instruction_change)
        return True  # baseline updates to new score
    else:
        log(f"DISCARD: delta={delta:+.3f} — reverting")
        # git reset HEAD~1 --soft (if committed) or just restore the instruction
        revert_change()
        return False

Git is used as the undo mechanism. Each candidate instruction is applied to agent.py as a diff, and if the change is discarded, git reset HEAD~1 --soft reverts it without losing the work log.

The Research Cache

Without the research cache, each eval run requires 5 × 3 = 15 separate search sessions — roughly 30–60 minutes of network and compression time before any synthesis happens. With the cache:

export RESEARCH_CACHE=1
python autoresearch.py

The cache stores the complete research context (post-compression, post-enrichment) for each task. Synthesis starts immediately. A 15-run eval session takes 25–35 minutes instead of 90–120 minutes.

The cache TTL is 24 hours. Autoresearch sessions that span multiple days re-fetch research on the first run of each day.

Session 1 Findings

Session 1 ran 13 experiments across two days, testing 13 candidate instruction changes against a baseline score of 8.285.

Best-performing change (exp 3, score 8.845, delta +0.560):

"Added requirement for production-ready integration examples with full agent loop usage, error handling, and real-world scenarios to address depth and specificity weaknesses."

This is the only change from Session 1 that survived (delta > 0.1) and became the new baseline.

What doesn't work:

Experiment	Score	Change	Why it failed
exp 2	8.250	Explicit tool versions + integration steps	Over-constrains format
exp 4	8.180	Measurable outcome per section	Too mechanical
exp 5	7.935	Expected cost improvement per technique	Off-topic for synthesis tasks
exp 7	7.615	Complete executable code + error handling	Forces code where prose is better
exp 8	8.530	What/Why/How/Outcome sub-structure	Helpful but too rigid

High variance (±1.2 across sessions) is the key diagnostic finding: the instruction framing matters more than any single element. Small changes in how requirements are phrased produce large swings in quality.

The Proposer Clustering Problem

After the initial success, the proposer clustered heavily in the "add code examples" space for 10 consecutive experiments. Each proposal was a variation of the same theme, all producing scores below threshold.

This is the same problem that affects any optimizer stuck in a local region: the proposer's training data skews toward code-centric solutions for technical writing tasks.

Escape strategies:

Research grounding: In Session 2, a gather_proposal_context() step was added — the proposer first researches prompt engineering literature, then proposes changes grounded in what it reads. This breaks the proposer out of its training distribution.
Negative constraints: Adding explicit instructions to NOT generate code-centric changes for non-enumerable tasks.
Dimension weighting: Steering proposals toward dimensions that autoresearch hasn't yet explored: output structure, comparison/trade-off framing, uncertainty acknowledgment.

def gather_proposal_context(model, planner_model):
    """Research prompt engineering literature before proposing."""
    queries = [
        "LLM synthesis instruction techniques depth specificity",
        "prompt engineering output quality improvement",
        "chain of thought few shot prompting production"
    ]
    # Run with /deep to force comprehensive search
    return gather_research(
        "Research prompt engineering improvements for AI synthesis",
        planned_queries=queries,
        max_rounds=3
    )

Overview Next →

The Autoresearch Loop

The Idea

The Three Phases

The Research Cache

Session 1 Findings

The Proposer Clustering Problem

Privacy Policy

What we collect

What we don't collect

Your choices

Contact