Harness Engineering for AI Agents · Production Systems

Cost & Inference Management

12 min read

By the end of this reading you will be able to:

Calculate the token cost of a multi-stage run from a runs.jsonl entry and identify which stage consumes the most tokens
Explain the Ollama keep-alive model hot-loading strategy and quantify the latency savings for consecutive runs on the same model
Describe the vLLM AWQ quantization approach and explain the memory-quality tradeoff compared to full-precision inference
Describe the inference shim's context-length retry logic (60%/20% truncation) and explain why thinking-model output is suppressed for synthesis tasks

Why Cost Management Matters for Local Models

The harness runs on local hardware with no per-token billing — but cost is real. The cost is measured in:

Latency: a single run with the 32B model takes 3–8 minutes depending on output length and Wiggum rounds
VRAM: running multiple models simultaneously or models larger than VRAM capacity causes swapping that multiplies latency
Power: on a laptop, a 4-hour autoresearch session is meaningful energy consumption

Cost management for local models is primarily about latency minimization and VRAM allocation — not dollar billing.

Token Accounting by Stage

Every run logs token consumption by stage:

"tokens_by_stage": {
  "memory_retrieval":    0,      // embedding only, no generation
  "planning":            1240,   // planner model call
  "research_compression": 3820,  // compress_knowledge() × N rounds
  "search_query_gen":    420,    // plan_query() × N rounds beyond 2
  "synthesis":           8450,   // producer model, main call
  "count_check_retry":   0,      // 0 if count was correct first time
  "wiggum_r1":          2100,   // evaluator call
  "wiggum_r2":          1980,   // revision + re-evaluation
  "memory_compression": 890      // compress_and_store()
}

Typical run breakdown by percentage of total tokens:

Stage	~% of total
Synthesis	45–55%
Research compression	15–25%
Wiggum (all rounds)	15–25%
Planning + memory	5–10%

Synthesis dominates — but it is also the least reducible because it is the output-producing stage. Research compression and Wiggum are the most reducible: using a smaller compress model (glm4:9b instead of the producer) and avoiding unnecessary Wiggum rounds.

Token Cost Reduction Strategies

Use a smaller compress model:

export COMPRESS_MODEL=qwen2.5:7b   # instead of the producer

The compression task ("summarize these search results into 5-8 bullets") does not require the producer's full capability. A 7B model handles it well and costs ~4x less latency.

Use a smaller planner:

export PLANNER_MODEL=glm4:9b

Already the default, but worth making explicit. The planning task is structured and low-creativity — a fast 9B model is appropriate.

Use the research cache:

export RESEARCH_CACHE=1

Caches the complete research context (not just individual search results) for up to 24 hours. Subsequent runs on the same task skip the entire gather_research() stage. Essential for autoresearch sessions that run the same 5 tasks dozens of times.

Reduce Wiggum rounds: Invest in better synthesis instructions (the autoresearch objective) to improve first-pass scores above threshold, eliminating revision rounds entirely for most runs.

Model Hot-Loading with Ollama Keep-Alive

When a model is loaded for the first time in a run, Ollama loads its weights from disk into VRAM — this takes 20–60 seconds for large models. Subsequent calls to the same model within the keep-alive window skip loading:

# Set a long keep-alive to avoid reloading between calls in the same run:
response = ollama.chat(
    model="pi-qwen-32b",
    messages=messages,
    options={"keep_alive": "10m"}  # keep weights in VRAM for 10 minutes
)

For a run with synthesis + 2 Wiggum revision rounds (3 producer calls total), keep-alive saves 40–180 seconds of model loading. For autoresearch sessions with dozens of consecutive runs, the savings compound significantly.

The tradeoff: keeping a 20 GB model in VRAM blocks VRAM for other processes. Set keep_alive to the approximate duration between producer calls within a session (typically 2–5 minutes).

vLLM with AWQ Quantization

For systems where the 32B model exceeds VRAM capacity, vLLM with AWQ int4 quantization offers a meaningful alternative:

Configuration	Model Size	VRAM Required	Quality vs. Full
Qwen2.5-32B Q4_K_M (Ollama)	20 GB	20+ GB	100% (reference)
Qwen2.5-14B-AWQ (vLLM)	~9.4 GB	~10 GB	~90–95%
Qwen2.5-7B Q4 (Ollama)	~5 GB	~6 GB	~75–80%

AWQ (Activation-aware Weight Quantization) is a post-training quantization method that identifies and preserves the weights most important for output quality. The result is a 4-bit model that degrades less than naive int4 quantization.

# vLLM serve in WSL2 Ubuntu:
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
  --dtype half \
  --quantization awq_marlin \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90

--enable-prefix-caching is important for autoresearch: when the same system prompt appears at the start of every call, vLLM caches the KV entries for the prefix, saving prefill computation on every subsequent call that shares the prefix.

Context-Length Retry

Long synthesis contexts — research outputs combined with task descriptions and memory injections — occasionally exceed the model's configured context window. Rather than failing hard, the inference shim retries up to two times with truncated context:

_CONTEXT_ERROR_STRINGS = (
    "maximum context length",
    "exceeds the available context size",
    "context length exceeded",
)

def _truncate_for_retry(messages: list[dict]) -> None:
    """Truncate the longest non-system message: keep head (60%) + tail (20%)."""
    longest = max(
        (m for m in messages if m.get("role") != "system"),
        key=lambda m: len(m.get("content", ""))
    )
    content = longest["content"]
    keep_head = int(len(content) * 0.60)
    keep_tail = int(len(content) * 0.20)
    longest["content"] = (
        content[:keep_head]
        + "\n\n[... context truncated ...]\n\n"
        + content[-keep_tail:]
    )

for attempt in range(3):
    try:
        return _stream_vllm_call(client, vllm_model, messages, oai_kwargs)
    except Exception as exc:
        if not any(s in str(exc) for s in _CONTEXT_ERROR_STRINGS):
            raise
        if attempt == 2:
            raise
        _truncate_for_retry(messages)

The 60%/20% split is deliberate: the head contains the task description and instruction preamble (highest signal), the tail contains the most recent research or prior synthesis (second-highest signal), and the middle body — where redundant search results accumulate — is dropped. The system prompt is never truncated.

Context-length errors are more common with vLLM than Ollama because vLLM's --max-model-len is set conservatively to limit memory pressure. If retries happen frequently for a task type, the right fix is to reduce the research context size upstream, not to rely on truncation.

Thinking-Model Suppression

Some models (Qwen3, DeepSeek-R1) include a reasoning trace (<think>...</think>) before their final response. For synthesis tasks, the reasoning trace is pure overhead: it consumes tokens and latency but the student output is the only artifact the harness needs.

Two suppression mechanisms work together:

At call time — if the model is identified as a thinking model, the think option is set to False:

def _is_thinking_model(model_name: str) -> bool:
    return any(name in model_name.lower() for name in ("qwen3", "r1", "deepseek-r1"))

# In agent.py, before producer call:
if _is_thinking_model(producer_model):
    opts["think"] = False

This suppresses the reasoning trace at the model level — most thinking models respect this flag and produce direct output. The call is faster and uses fewer tokens.

At parse time — as a fallback, the response parser strips any <think> block that slipped through:

class _OllamaResponse:
    def __init__(self, oai_message):
        raw = getattr(oai_message, "content", "") or ""
        reasoning = getattr(oai_message, "reasoning_content", None) or ""
        if not reasoning:
            m = re.search(r"<think>(.*?)</think>", raw, re.DOTALL)
            if m:
                reasoning = m.group(1).strip()
                raw = raw[raw.rfind("</think>") + len("</think>"):].strip()
        self.thinking = reasoning  # stored in trace for debugging
        self.content  = raw        # what callers receive

The thinking trace is stored on the response object (accessible for debugging via the run trace) but is not returned to callers. This means the synthesis stage never sees reasoning tokens and cannot accidentally include them in output.

The Inference Backend Shim Cost

The shim (inference.py) adds negligible overhead — one dictionary lookup per call to resolve the model name to a backend-specific identifier. The benefit is that all token accounting, logging, and error handling is centralized in the shim rather than scattered across callers:

class OllamaLike:
    def chat(self, model, messages, options=None, stream=False):
        with trace.span(f"llm:{model}"):
            if self.backend == "vllm":
                response = self._vllm_chat(model, messages, options)
            else:
                response = ollama.chat(model=model, messages=messages,
                                        options=options, stream=stream)
            # Centralized token logging — all callers get this automatically
            trace.log_usage(response, stage=self._current_stage)
            return response

References

AWQ — AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Previous Next →

Cost & Inference Management

Why Cost Management Matters for Local Models

Token Accounting by Stage

Token Cost Reduction Strategies

Model Hot-Loading with Ollama Keep-Alive

vLLM with AWQ Quantization

Context-Length Retry

Thinking-Model Suppression

The Inference Backend Shim Cost

Privacy Policy

What we collect

What we don't collect

Your choices

Contact