Model Roles & Separation
- Explain the circular evaluation problem and why a model cannot reliably score its own output
- Describe each model role in the harness (producer, evaluator, planner, vision) and the constraints on model selection for each
- Configure the inference backend shim to route model calls to Ollama or vLLM using environment variables
The Circular Evaluation Problem
Design principle 5 states: evaluator and producer must be different models. This is not aesthetic preference — it is a correctness requirement.
When a model evaluates its own output, it reproduces the same reasoning and the same systematic biases it used to produce the output. If the model tends to overcount sections, it will also overcount when verifying them. If the model's training led it to prefer confident-sounding prose over concrete implementation detail, it will score confident-sounding prose highly regardless of whether implementation detail is present.
External evaluation, by a model from a different family or a significantly larger model, breaks this circularity. The evaluator applies genuinely independent judgment. The revision feedback it produces is more likely to identify the actual problems rather than rationalize existing choices.
This is the same reason academic peer review uses external reviewers, and the same reason code review is valuable even when the author is competent.
Model Roles
The harness assigns different models to different roles based on their required capabilities and the latency budget for each stage:
| Role | Model | Why |
|---|---|---|
| Producer (default) | pi-qwen-32b (Qwen2.5-32B Q4_K_M) |
Largest available model for maximum depth; custom Modelfile with optimized system prompt |
| Producer (fallback) | pi-qwen (qwen2.5:7b) |
~3× faster; use on 16 GB RAM systems or for fast iteration |
| Evaluator | Qwen3-Coder:30b |
Must be larger than producer or from a different family; coding background improves specificity scoring |
| Planner / Compressor | glm4:9b |
Fast enough for per-round compression; different architecture from producer prevents prompt echo |
| Annotator | nanda-annotator |
QLoRA fine-tuned Qwen2.5-7B on domain-specific annotations; produced in Module 5 |
| Vision | llama3.2-vision |
Image-to-text preprocessing only; does not synthesize or evaluate |
| GitHub skill | llama3.2:3b |
Commit message and PR generation only; fast, low-stakes |
The Custom Modelfile
Ollama Modelfiles define a model's system prompt, temperature, context window, and other parameters. The producer model uses a custom Modelfile that sets:
FROM qwen2.5:32b-instruct-q4_K_M
SYSTEM """
You are a research and synthesis assistant. Your task is to produce comprehensive,
well-structured Markdown documents based on the provided research context.
Always:
- Begin output with a top-level Markdown heading (# Title)
- Use concrete examples, named tools, and specific implementation steps
- Provide implementation notes for each technique or concept covered
- Output ONLY valid Markdown — no preamble, no commentary about your approach
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 32768
PARAMETER num_predict 4096
The system prompt is where synthesis instruction engineering lives — not in the per-request user prompt. This keeps the synthesis prompt clean and makes the system instruction the sole target of autoresearch optimization.
The Inference Backend Shim
inference.py provides a unified interface that routes model calls to either Ollama (default) or vLLM, using the same call signature throughout the codebase:
from inference import OllamaLike
client = OllamaLike() # reads INFERENCE_BACKEND env var
response = client.chat(
model="pi-qwen-32b",
messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.3, "num_predict": 4096}
)
content = response["message"]["content"]
tokens = response["usage"]["total_tokens"]
When INFERENCE_BACKEND=vllm, the shim remaps model names via VLLM_MODEL_MAP and translates the Ollama response format to match what the rest of the harness expects. The harness code never checks which backend is running.
vLLM Setup (GPU-Limited Systems)
For systems where the 32B model is too large for Ollama (requires ~20 GB unified RAM+VRAM), vLLM with AWQ int4 quantization reduces the memory footprint to ~9.4 GB:
# In WSL2 Ubuntu:
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
--dtype half --quantization awq_marlin \
--max-model-len 8192 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90
Then in .env:
INFERENCE_BACKEND=vllm
VLLM_BASE_URL=http://localhost:8000/v1
VLLM_MODEL_MAP={"pi-qwen-32b":"Qwen/Qwen2.5-14B-Instruct-AWQ"}
The model map is how a single physical model serves multiple logical roles — pi-qwen-32b in the harness code becomes Qwen/Qwen2.5-14B-Instruct-AWQ on the vLLM server. The harness never sees the underlying model name.
Configuration via Environment Variables
All model role assignments are configurable without code changes:
# Override producer
export HARNESS_PRODUCER_MODEL=mistral-small3.1:24b
# Override evaluator
export WIGGUM_EVALUATOR_MODEL=phi4:14b
# Override planner/compressor
export PLANNER_MODEL=qwen2.5:7b
export COMPRESS_MODEL=qwen2.5:7b # can differ from planner
# Switch inference backend
export INFERENCE_BACKEND=vllm
This is what "build for deletion" looks like in practice: any model in any role can be replaced with a single environment variable change. The harness does not hardcode model names in business logic.