Harness Engineering for AI Agents · Context Engineering & Memory

The Dual-Search Foundation

12 min read
By the end of this reading you will be able to:
  • Describe the fixed dual-search loop and the quality-floor fallback mechanism, including the conditions that trigger each
  • Explain how the 24-hour SQLite TTL cache reduces search latency and what tradeoffs it introduces
  • Identify the failure modes of a fixed N-search loop (over-search and under-search) and explain why they motivated saturation gating

The Original Research Loop

The harness began with a fixed dual-search architecture: run exactly two web searches per task, merge the results, pass them to synthesis. Simple, deterministic, reproducible.

This is the version validated in Experiment 1. The 9-run CRD across three task types used the dual-search loop as a controlled variable — by holding search count constant, the experiment isolated the effect of task type on output quality.

How the Dual-Search Loop Works

SEARCHES_PER_TASK = 2
SEARCH_QUALITY_FLOOR = 1800  # minimum characters of results to proceed

def gather_research(task, planned_queries):
    all_results = []

    for i, query in enumerate(planned_queries[:SEARCHES_PER_TASK]):
        results = web_search_cached(query)

        # Quality floor: if results are thin, try a fallback query
        if total_chars(results) < SEARCH_QUALITY_FLOOR and i == 0:
            fallback = simplify_query(query)
            results = web_search_cached(fallback)
            log(f"[quality floor] fallback triggered: {fallback}")

        all_results.extend(results)

    return format_results(deduplicate(all_results))

Two searches per task. The first query comes from the planner's search_queries[0] — the most targeted query for the task. The second comes from search_queries[1] — a complementary angle.

The Quality Floor

The quality floor is a minimum-adequacy check on the first search. If the first query returns fewer than 1,800 characters of results — which happens when a query is too niche, misspelled, or uses terminology that doesn't match the web's vocabulary — a fallback query is generated automatically:

def simplify_query(query: str) -> str:
    # Strip technical jargon, keep the core concept
    # e.g. "saturation-gated novelty-assessed research loop LLM" → "LLM research loop"
    words = query.split()
    return " ".join(words[:4])  # first 4 words as a crude simplification

The floor prevents synthesis from running on near-empty context — a situation that reliably produces placeholder content and low Wiggum scores.

The SQLite Search Cache

All DDGS (DuckDuckGo Search) calls are wrapped in a 24-hour TTL cache:

import hashlib, sqlite3, time

def web_search_cached(query: str) -> list[dict]:
    key = hashlib.sha256(query.encode()).hexdigest()
    conn = sqlite3.connect('search_cache.db')

    # Check cache
    row = conn.execute(
        'SELECT results, timestamp FROM search_cache WHERE key=?', (key,)
    ).fetchone()

    if row and (time.time() - row[1]) < 86400:  # 24h TTL
        return json.loads(row[0])

    # Cache miss — fetch and store
    results = ddgs_search(query)
    conn.execute(
        'INSERT OR REPLACE INTO search_cache VALUES (?,?,?)',
        (key, json.dumps(results), time.time())
    )
    conn.commit()
    return results

The cache serves two purposes:

  1. Speed: Repeated runs on the same task within 24 hours skip the network round-trip. Autoresearch sessions run many variants of similar tasks — without caching, this would mean hundreds of identical network calls.

  2. Reproducibility: Within a single experiment, all runs with the same query get the same search results. This is essential for controlled comparisons — if search results vary between runs, you can't isolate the effect of the harness change you're testing.

The tradeoff is staleness: cached results are up to 24 hours old. For time-sensitive topics this matters; for knowledge synthesis over established techniques, it doesn't.

Result Format

DDGS returns a list of dicts. The harness formats these for synthesis context:

def format_results(results: list[dict]) -> str:
    blocks = []
    for r in results:
        blocks.append(
            f"**{r.get('title', 'Untitled')}**\n"
            f"{r.get('href', '')}\n\n"
            f"{r.get('body', '')}"
        )
    return "\n\n---\n\n".join(blocks)

The formatted string is what gets passed to synthesis. Each result has a title (bold), a URL, and the snippet body. The producer model can cite URLs in its output.

The dual-search loop is adequate for Experiment 1 but has two systematic failure modes:

Over-search — simple, well-covered topics saturate after one search round. The second query fetches results that overlap heavily with the first: same facts, same sources, different phrasing. This inflates the synthesis context with redundant information, sometimes causing the model to structure output around the search format rather than the task structure. It also wastes time and tokens.

Under-search — complex, cross-disciplinary topics have meaningful signal beyond the second query. Two searches covering "context engineering LLM" and "RAG retrieval augmented generation" miss adjacent material on context compression, tool-calling strategies, and recent papers on context window management. The agent synthesizes from an incomplete picture.

URL enrichment has the same problem. Fetching full page content for URLs whose snippet is already fully covered wastes 30–60 seconds per URL and dilutes the synthesis context.

These failure modes motivated the saturation-gating approach described in the next reading.