Harness Engineering for AI Agents · Context Engineering & Memory

Memory Systems

12 min read

By the end of this reading you will be able to:

Explain how ChromaDB semantic retrieval and SQLite FTS5 serve different retrieval roles and when each is more appropriate
Describe the Observation schema and explain which fields drive semantic retrieval vs. keyword retrieval
Trace how a completed run becomes a stored observation, how that observation is retrieved using quality-weighted ranking, and how it is injected in a future run on a related topic

Why Persistent Memory?

Without memory, every run starts from scratch. The agent searches the web for facts it already established two runs ago, synthesizes conclusions it has already drawn, and produces output that ignores its own prior work. This wastes tokens, wastes time, and produces inconsistent results across runs on similar topics.

The memory system gives the harness a cumulative picture of what it has learned. Each run adds to a growing store of observations. Future runs on related topics retrieve the relevant observations before planning, making the planner aware of prior coverage and the synthesis stage aware of established findings.

The Dual-Index Architecture

The memory system uses two complementary retrieval mechanisms:

ChromaDB — a vector database using all-MiniLM-L6-v2 embeddings for semantic similarity retrieval. Given a new task string, ChromaDB finds observations whose meaning is similar, even if the vocabulary differs. A task about "LLM cost management" retrieves observations about "token budget control" and "inference cost optimization" without exact keyword matches.

SQLite FTS5 — a full-text search index for keyword retrieval. FTS5 is faster than ChromaDB and handles exact matches better. When the task contains specific technical terms — model names, technique names, algorithm names — FTS5 often outperforms semantic search.

The get_context() function queries both and merges the results, deduplicating by observation ID:

def get_context(task: str, top_k: int = 3) -> str:
    # Semantic search via ChromaDB (returns distances alongside hits)
    chroma_hits = chroma_collection.query(
        query_texts=[task],
        n_results=top_k,
        include=["documents", "distances", "metadatas"]
    )
    # FTS5 keyword search
    fts_query = " OR ".join(extract_keywords(task))
    fts_hits = db.execute(
        'SELECT id, narrative, facts, final_score, quality FROM observations WHERE observations MATCH ?',
        (fts_query,)
    ).fetchall()
    # Merge, deduplicate, rank by quality-weighted score
    return format_context(rank_and_merge(chroma_hits, fts_hits, top_k))

Quality-Weighted Retrieval Ranking

Returning the top-k semantically similar observations is necessary but not sufficient. An observation with a Wiggum score of 5.8 — meaning the run produced shallow, placeholder-heavy output — provides weaker planning signal than one with a score of 9.1 on the same topic. Ranking by semantic similarity alone would surface both equally.

rank_and_merge() applies a quality-weighted composite score to all candidate observations before selecting the top-k:

def _rank_score(sim: float, row: dict) -> float:
    raw_score = row["final_score"]

    # Soft floor: penalise failed/low-quality runs
    if raw_score is not None and raw_score < 7.0:
        qual = (raw_score / 10.0) * 0.5
    else:
        qual = (raw_score or 5.0) / 10.0

    # quality metadata adjustment (manual ratings from inspect_run.py, optional)
    q_adjust = max(0.2, 1.0 + (row["quality"] or 0) * 0.15)

    return (0.7 * sim + 0.3 * qual) * q_adjust

The formula has three components:

Semantic similarity (70%) — ChromaDB returns L2 distances; sim = 1.0 - (dist / 2.0) converts these to a 0–1 similarity. This is the primary signal: observations close in meaning to the current task rank first.

Quality score (30%) — the observation's Wiggum final score, normalized to 0–1. Observations with final_score < 7.0 are penalized with a 50% floor on their quality contribution: a score of 6.0 contributes 0.6 × 0.5 = 0.30 rather than 0.60. This prevents failed or low-quality runs from anchoring synthesis when better observations exist.

Quality adjustment — an optional manual rating (0–2, set via inspect_run.py --rate) that can boost or penalize specific observations. A rating of +1 multiplies the composite rank by 1.15; a rating of -1 multiplies it by 0.85. This is a lightweight feedback mechanism for runs the heuristic score didn't fully capture.

The result: a query about "context engineering" retrieves the highest-quality observation on that topic, not just the semantically nearest one. Low-scoring runs on adjacent topics are deprioritized without being excluded entirely.

The Observation Schema

Each observation stored in memory represents a single completed run:

@dataclass
class Observation:
    id: str            # unique run identifier
    title: str         # one-line summary of the run's topic
    narrative: str     # 2-3 sentence description of what was covered and how
    facts: list[str]   # 5-10 bullet points of key findings
    task: str          # original task string
    task_type: str     # "enumerated" | "best_practices" | "research"
    final_score: float # Wiggum score — quality signal for retrieval ranking
    timestamp: str

The narrative field is what gets embedded in ChromaDB — it describes the run in natural language, enabling semantic retrieval. The facts list is what gets injected into the planning prompt — concrete findings rather than a paragraph description.

compress_and_store()

After a successful run, the planner model (glm4:9b) compresses the run output into an observation:

COMPRESS_PROMPT = """\
You have just completed the following task:
Task: {task}

Output:
{output[:2000]}

Produce a structured observation for storage in memory:
1. A one-line title describing the topic
2. A 2-3 sentence narrative of what was covered and what was concluded
3. 5-10 key facts as bullet points

Output as JSON: {{"title": ..., "narrative": ..., "facts": [...]}}"""

def compress_and_store(task, output, task_type, final_score, model):
    prompt = COMPRESS_PROMPT.format(task=task, output=output)
    response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}],
                           options={"temperature": 0.1})
    obs_dict = json.loads(response["message"]["content"])

    obs = Observation(
        id=generate_run_id(),
        title=obs_dict["title"],
        narrative=obs_dict["narrative"],
        facts=obs_dict["facts"],
        task=task,
        task_type=task_type,
        final_score=final_score,
        timestamp=datetime.now().isoformat()
    )

    # Store in SQLite (for FTS5 and structured queries)
    store_sqlite(obs)
    # Embed narrative in ChromaDB (for semantic retrieval)
    chroma_collection.add(
        documents=[obs.narrative],
        ids=[obs.id],
        metadatas=[{"title": obs.title, "score": obs.final_score}]
    )

What Gets Injected

When a new task retrieves relevant observations, the injected context looks like this:

Prior work on related topics:

[Context Engineering — score 8.7]
Covered the top 5 context engineering techniques for production LLM agents including
RAG, context compression, and chain-of-thought prompting.
Key facts:
• RAG outperforms static context injection for dynamic knowledge tasks
• Context compression via summarization reduces token cost by 60-80%
• Few-shot examples should be selected by semantic similarity to the current query
• ...

[Cost Management for AI Agents — score 8.2]
...

The planner sees this context and explicitly notes in Plan.prior_work what is already covered, so the research stage doesn't re-fetch it.

Inspecting Memory

python memory.py                           # list recent observations
python memory.py --search "context window" # test retrieval for a query

The --search flag is useful for debugging retrieval before a run: does the memory store have relevant observations for the planned task? If not, the planner will get no memory context, and the research stage starts from scratch — which is the correct behavior for genuinely novel topics.

Previous Next →

Memory Systems

Why Persistent Memory?

The Dual-Index Architecture

Quality-Weighted Retrieval Ranking

The Observation Schema

compress_and_store()

What Gets Injected

Inspecting Memory

Privacy Policy

What we collect

What we don't collect

Your choices

Contact