Harness Engineering for AI Agents · Context Engineering & Memory

Saturation Gating

15 min read
By the end of this reading you will be able to:
  • Implement a saturation-gated search loop using heuristic novelty assessment, a rolling knowledge state, and a configurable novelty threshold
  • Compare heuristic (word-overlap) and model-based novelty assessment, explaining the latency and accuracy tradeoffs of each
  • Explain how compress_knowledge() maintains a rolling summary and why it is only called for accepted search rounds
  • Explain the purpose of NOVELTY_EPSILON in the saturation gate and describe the failure mode it prevents when all queries return below-threshold novelty scores

The Design Problem

The fixed dual-search loop runs exactly two searches regardless of topic complexity. Saturation gating replaces the fixed loop with a stopping criterion: keep searching while new results are genuinely new; stop when they start repeating.

This is the same principle that guides a competent human researcher: you stop when your search results start recapitulating what you already know.

Configuration

MAX_SEARCH_ROUNDS   = 5     # hard cap regardless of novelty
NOVELTY_THRESHOLD   = 3     # 0–10; stop if new results score below this
KNOWLEDGE_MAX_CHARS = 1500  # cap on rolling knowledge state fed to novelty prompt
NOVELTY_EPSILON     = 0.15   # ε-greedy: pass sub-threshold results through 15% of the time

The existing SEARCHES_PER_TASK = 2 becomes a minimum — the loop always runs at least 2 rounds before novelty gating kicks in. This preserves backward compatibility: simple tasks run exactly 2 rounds, complex tasks run up to 5.

The Saturation Loop

def gather_research(task, planned_queries, ...):
    knowledge_state = ""    # rolling compressed summary
    all_results = []        # deduplicated raw results

    for round in range(1, MAX_SEARCH_ROUNDS + 1):
        # Generate query: use planned_queries for rounds 1-2, then plan_query()
        if round <= len(planned_queries):
            query = planned_queries[round - 1]
        else:
            query = plan_query(task, knowledge_state, round)

        results = web_search_cached(query)
        novelty = assess_novelty(results, knowledge_state)
        log(f"[search {round}] novelty={novelty} query={query}")

        # Gate: stop if saturation reached (after minimum rounds)
        if novelty < NOVELTY_THRESHOLD and round > SEARCHES_PER_TASK:
            if random.random() < NOVELTY_EPSILON:
                log("  [novelty] saturation but ε-greedy pass-through — continuing")
            else:
                log("  [novelty] saturation — stopping search")
                break

        # Accept this round
        all_results = merge_deduplicated(all_results, results)
        knowledge_state = compress_knowledge(knowledge_state, results)

    # URL enrichment — only fetch URLs not already covered by knowledge_state
    enriched = enrich_novel_urls(all_results, knowledge_state)
    return format_results(all_results) + enriched

The key invariant: compress_knowledge() is only called for accepted rounds — rounds that pass the novelty gate. Rejected rounds produce no model call and no state update. This prevents wasted compute on low-value search rounds.

Epsilon-Greedy Pass-Through

The saturation gate has a failure mode: in sessions where every query happens to return below-threshold novelty — due to caching, query convergence, or a topic that is genuinely exhausted — the loop terminates at the minimum two rounds even for complex tasks.

NOVELTY_EPSILON = 0.15 adds a random escape valve. Fifteen percent of the time, a below-threshold round is accepted anyway and the loop continues:

if novelty < NOVELTY_THRESHOLD and round > SEARCHES_PER_TASK:
    if random.random() < NOVELTY_EPSILON:
        log("  [novelty] saturation but ε-greedy pass-through — continuing")
    else:
        log("  [novelty] saturation — stopping search")
        break

The pass-through does not reset the threshold — if the next round also scores below threshold, it is again subject to the 15% gate. The effect is that no single low-novelty result can terminate the loop with certainty; a sustained sequence of low-novelty results terminates it with high probability.

The name comes from ε-greedy exploration in reinforcement learning: exploit the stopping condition 85% of the time (stop when results are stale), explore past it 15% of the time (continue in case the heuristic underestimated novelty). The analogy holds — the harness is balancing exploitation against exploration on every search round.

Novelty Assessment: Heuristic

def assess_novelty_heuristic(new_results: list[dict], knowledge_state: str) -> int:
    new_words = set(
        w for r in new_results
        for w in r.get("body", "").lower().split()
    )
    known_words = set(knowledge_state.lower().split())
    if not new_words:
        return 0
    novel_fraction = len(new_words - known_words) / len(new_words)
    return round(novel_fraction * 10)  # 0–10

Word-level set difference: what fraction of words in the new results have not appeared in the knowledge state? A score of 3 means 30% of words are new — the rest are repetitions. Below the threshold of 3, the round is rejected.

Pros: ~0ms, no model call, fully deterministic, no latency impact on autoresearch sessions.

Cons: Vocabulary overlap is a weak proxy for semantic novelty. A result that paraphrases everything in the knowledge state using different words will score high even though it adds no new information.

Novelty Assessment: Model-Based

NOVELTY_PROMPT = """\
What is already known:
{knowledge_state}

New search results:
{new_results}

Do these results add genuinely new information not already covered above?
Score 0–10 where 0 = completely redundant, 10 = entirely new information.
Output ONLY the integer score, nothing else."""

def assess_novelty_model(new_results, knowledge_state, model) -> int:
    snippet = format_results(new_results)[:800]
    prompt  = NOVELTY_PROMPT.format(
        knowledge_state=knowledge_state[:800],
        new_results=snippet,
    )
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0, "num_predict": 3}  # 1-2 output tokens only
    )
    raw = response["message"]["content"].strip()
    match = re.search(r'\d+', raw)
    return int(match.group()) if match else 5  # default neutral on parse failure

Pros: Semantic understanding — catches paraphrase duplicates, topic drift, and structural repetition.

Cons: Adds ~10–15 seconds per round (prefill-dominated; only 1–3 output tokens needed). On autoresearch sessions with many consecutive runs, this latency compounds.

Recommendation: Start with heuristic novelty. Switch to model-based if heuristic is noisy (e.g. accepting rounds that produce structurally duplicate content).

Knowledge Compression

COMPRESS_PROMPT = """\
Current knowledge summary:
{current_state}

New search results to incorporate:
{new_results}

Update the summary to include the new information. Be concise — 5-8 bullet points,
each starting with a key fact. Do not exceed {max_chars} characters total.
Output ONLY the bullet points, nothing else."""

def compress_knowledge(current_state, new_results, model, max_chars=KNOWLEDGE_MAX_CHARS):
    if not current_state:
        # First round: build initial state directly from results (no model call)
        bodies = " ".join(r.get("body", "") for r in new_results)[:1200]
        return bodies[:max_chars]

    prompt = COMPRESS_PROMPT.format(
        current_state=current_state,
        new_results=format_results(new_results)[:800],
        max_chars=max_chars
    )
    response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}],
                           options={"temperature": 0.1, "num_predict": 400})
    return response["message"]["content"].strip()[:max_chars]

The knowledge state is a rolling 1,500-character bullet-point summary of everything gathered so far. It serves two purposes:

  1. Input to novelty assessment — new results are compared against it
  2. Input to adaptive query generationplan_query() uses it to identify gaps

The first round skips the model call and builds the state directly from search results (fast path). Subsequent rounds compress incrementally — the model adds new facts and removes redundancies, keeping the total under the character cap.

Adaptive Query Generation

For rounds 3 and beyond, instead of using pre-planned queries, plan_query() generates a gap-filling query:

PLAN_QUERY_PROMPT = """\
Task: {task}

What is already known:
{knowledge_state}

Generate ONE search query to find important information about the task NOT yet covered
above. Output ONLY the query string, nothing else."""

def plan_query(task, knowledge_state, round, model):
    if round <= len(planned_queries) or not knowledge_state:
        # Use pre-planned queries for early rounds
        return planned_queries[round - 1] if round <= len(planned_queries) else task

    response = ollama.chat(model=model,
                           messages=[{"role": "user", "content":
                               PLAN_QUERY_PROMPT.format(task=task,
                                                        knowledge_state=knowledge_state)}],
                           options={"temperature": 0.3, "num_predict": 60})
    return response["message"]["content"].strip().strip('"')

By round 3, the model knows what the first two queries covered and can generate a query specifically targeting the uncovered territory. This is the mechanism that allows complex topics to get 4–5 rounds of genuinely diverse search coverage.

URL Enrichment with Novelty Gating

def enrich_novel_urls(results, knowledge_state, count=URL_ENRICH_COUNT) -> str:
    blocks = []
    fetched = 0
    for r in results:
        if fetched >= count:
            break
        snippet = r.get("body", "")
        snippet_words = set(snippet.lower().split())
        known_words   = set(knowledge_state.lower().split())
        overlap = len(snippet_words & known_words) / max(len(snippet_words), 1)
        if overlap > 0.6:
            log(f"  [enrich] skipping {r['href'][:50]} — {overlap:.0%} overlap")
            continue
        content = fetch_url_content(r["href"])
        if content:
            blocks.append(f"**Full page: {r.get('title','')}**\n{r['href']}\n\n{content}")
            fetched += 1
    return "\n\n---\n\n".join(blocks)

URL enrichment is expensive — fetching and converting a full page takes 30–60 seconds. The novelty gate skips any URL whose snippet overlaps more than 60% with the knowledge state. Only genuinely new URLs get full-page treatment.