The Autoresearch Loop
- Explain the autoresearch keep rule and the delta threshold that determines whether a synthesis instruction change is committed or reverted
- Describe the three-phase autoresearch loop (propose → test → evaluate) and the role of the research cache in making it computationally feasible
- Analyze the Session 1 autoresearch findings to identify which synthesis instruction changes improved scores, which degraded them, and why the proposer clustered
The Idea
The synthesis instruction — the directive appended to every synthesis prompt — is the single highest-leverage parameter in the harness. It tells the model how to structure its output, what depth to aim for, and what traps to avoid. Manual tuning of this instruction is effective but slow: each candidate instruction requires running the full eval suite to measure its effect.
Autoresearch automates this loop. It is named for the idea, attributed to Andrej Karpathy, of a system that improves its own performance by evaluating and editing its own code or prompts.
The Three Phases
propose → test → evaluate/keep
Propose: An LLM generates a candidate change to the synthesis instruction, grounded in an analysis of recent failure patterns from runs.jsonl:
PROPOSE_PROMPT = """\
The current synthesis instruction is:
{current_instruction}
Recent failure patterns from evaluation:
{failure_summary}
Research context (from prompt engineering literature):
{research_context}
Propose ONE specific change to the synthesis instruction that would address
the most common failure pattern. The change should:
- Target depth or specificity (the weakest dimensions)
- Be a concrete addition or modification, not a vague suggestion
- Not make the instruction more restrictive in ways that reduce quality
Output ONLY the revised instruction, nothing else."""
Test: Run the eval suite (5 tasks, 3 replications each) with the proposed instruction. The research cache makes this feasible — search results are cached, so only the synthesis and Wiggum stages run. A full eval session takes ~30–45 minutes with the cache active.
Evaluate/Keep: Compute the composite score delta.
KEEP_THRESHOLD = 0.1 # composite score improvement required to keep
def evaluate_and_keep(baseline_score, candidate_score, instruction_change):
delta = candidate_score - baseline_score
if delta > KEEP_THRESHOLD:
log(f"KEEP: delta={delta:+.3f} — committing change")
# git add agent.py && git commit -m "autoresearch: {change_summary}"
commit_change(instruction_change)
return True # baseline updates to new score
else:
log(f"DISCARD: delta={delta:+.3f} — reverting")
# git reset HEAD~1 --soft (if committed) or just restore the instruction
revert_change()
return False
Git is used as the undo mechanism. Each candidate instruction is applied to agent.py as a diff, and if the change is discarded, git reset HEAD~1 --soft reverts it without losing the work log.
The Research Cache
Without the research cache, each eval run requires 5 × 3 = 15 separate search sessions — roughly 30–60 minutes of network and compression time before any synthesis happens. With the cache:
export RESEARCH_CACHE=1
python autoresearch.py
The cache stores the complete research context (post-compression, post-enrichment) for each task. Synthesis starts immediately. A 15-run eval session takes 25–35 minutes instead of 90–120 minutes.
The cache TTL is 24 hours. Autoresearch sessions that span multiple days re-fetch research on the first run of each day.
Session 1 Findings
Session 1 ran 13 experiments across two days, testing 13 candidate instruction changes against a baseline score of 8.285.
Best-performing change (exp 3, score 8.845, delta +0.560):
"Added requirement for production-ready integration examples with full agent loop usage, error handling, and real-world scenarios to address depth and specificity weaknesses."
This is the only change from Session 1 that survived (delta > 0.1) and became the new baseline.
What doesn't work:
| Experiment | Score | Change | Why it failed |
|---|---|---|---|
| exp 2 | 8.250 | Explicit tool versions + integration steps | Over-constrains format |
| exp 4 | 8.180 | Measurable outcome per section | Too mechanical |
| exp 5 | 7.935 | Expected cost improvement per technique | Off-topic for synthesis tasks |
| exp 7 | 7.615 | Complete executable code + error handling | Forces code where prose is better |
| exp 8 | 8.530 | What/Why/How/Outcome sub-structure | Helpful but too rigid |
High variance (±1.2 across sessions) is the key diagnostic finding: the instruction framing matters more than any single element. Small changes in how requirements are phrased produce large swings in quality.
The Proposer Clustering Problem
After the initial success, the proposer clustered heavily in the "add code examples" space for 10 consecutive experiments. Each proposal was a variation of the same theme, all producing scores below threshold.
This is the same problem that affects any optimizer stuck in a local region: the proposer's training data skews toward code-centric solutions for technical writing tasks.
Escape strategies:
Research grounding: In Session 2, a
gather_proposal_context()step was added — the proposer first researches prompt engineering literature, then proposes changes grounded in what it reads. This breaks the proposer out of its training distribution.Negative constraints: Adding explicit instructions to NOT generate code-centric changes for non-enumerable tasks.
Dimension weighting: Steering proposals toward dimensions that autoresearch hasn't yet explored: output structure, comparison/trade-off framing, uncertainty acknowledgment.
def gather_proposal_context(model, planner_model):
"""Research prompt engineering literature before proposing."""
queries = [
"LLM synthesis instruction techniques depth specificity",
"prompt engineering output quality improvement",
"chain of thought few shot prompting production"
]
# Run with /deep to force comprehensive search
return gather_research(
"Research prompt engineering improvements for AI synthesis",
planned_queries=queries,
max_rounds=3
)