Harness Engineering for AI Agents · Context Engineering & Memory

Large Document Context

12 min read
By the end of this reading you will be able to:
  • Describe the two chunker extraction strategies and explain which conditions trigger each
  • Interpret a provenance metadata tag and explain what each field communicates to the synthesis model
  • Explain how MarkItDown enables rich document conversion and what file types it handles

The Problem with Long Documents

When a task references a file — a PDF, a local Markdown document, a research paper — the naïve approach is to pass the entire file into the synthesis context. This fails in two ways:

  1. Context window overflow. A 200-page PDF at ~500 tokens per page is 100,000 tokens — far beyond most models' effective context windows. The synthesis call fails or produces degenerate output.

  2. Irrelevance. Even within the context window, a 20,000-token document stuffed into the context means the synthesis model must attend over a huge amount of text to find the relevant sections. Signal-to-noise degrades; quality drops.

The chunker addresses both problems by extracting only the relevant portions of a document, with provenance metadata so the model can cite specific passages.

Activation Threshold

CHUNK_THRESHOLD = 12_000  # characters (~3,000 tokens at ~4 chars/token)

def read_file_context(file_path: str, task: str) -> str:
    content = read_file(file_path)
    if len(content) < CHUNK_THRESHOLD:
        return content  # small enough to pass directly
    return extract_paper_context(content, task, file_path)

Files under ~12,000 characters are passed verbatim. Larger files go through the chunker.

Strategy 1: Section Extraction

For structured documents with 3 or more Markdown headings, the chunker uses section extraction:

SECTION_PRIORITY = [
    "Abstract", "Summary", "Conclusion", "Results",
    "Introduction", "Discussion", "Methods", "Background"
]

def section_extract(content, char_budget=8000):
    sections = parse_markdown_sections(content)  # {heading: text}
    selected = []
    used_chars = 0

    # Priority order: high-value sections first
    for target in SECTION_PRIORITY:
        for heading, text in sections.items():
            if target.lower() in heading.lower():
                if used_chars + len(text) <= char_budget:
                    selected.append((heading, text))
                    used_chars += len(text)

    # Fill remaining budget with other sections
    for heading, text in sections.items():
        if heading not in [s[0] for s in selected]:
            if used_chars + len(text) <= char_budget:
                selected.append((heading, text))
                used_chars += len(text)

    return format_with_provenance(selected)

Abstract and Conclusion come first because they compress the most information per character. Introduction and Results follow. The remaining budget is filled by other sections in document order.

Strategy 2: Semantic Chunk Retrieval

For unstructured documents — prose without Markdown headings — the chunker uses an ephemeral ChromaDB vector store:

def semantic_chunk_retrieve(content, task, char_budget=8000):
    # Create overlapping windows
    window_size = 500   # characters
    overlap = 100
    chunks = []
    for i in range(0, len(content), window_size - overlap):
        chunk_text = content[i:i + window_size]
        chunks.append({
            "text": chunk_text,
            "char_offset": i,
            "paragraph": content[:i].count("\n\n")
        })

    # Embed all chunks with all-MiniLM-L6-v2
    ephemeral_db = chromadb.EphemeralClient()
    collection = ephemeral_db.create_collection("chunks")
    collection.add(
        documents=[c["text"] for c in chunks],
        ids=[str(i) for i in range(len(chunks))],
        metadatas=chunks
    )

    # Retrieve top-K by cosine similarity to task
    results = collection.query(query_texts=[task], n_results=10)
    top_chunks = sorted(
        [chunks[int(id)] for id in results["ids"][0]],
        key=lambda c: c["char_offset"]  # re-sort to reading order
    )

    # Assemble within budget
    selected_text = ""
    for chunk in top_chunks:
        if len(selected_text) + len(chunk["text"]) <= char_budget:
            selected_text += chunk["text"] + "\n\n"
    return selected_text

The re-sort to reading order after retrieval is important: the model reads the selected passages as a coherent sequence, not as a collection of random snippets.

Provenance Metadata

Every extracted section or chunk is tagged with provenance metadata so the synthesis model can cite specific passages:

=== Introduction [source:paper.pdf | p.3 | ¶12 | §Introduction | @4,200] ===
Attention is the core mechanism of the transformer...

Tag fields:

  • source — file name
  • p.N — estimated page number (using a page_size character estimate)
  • ¶N — paragraph number (count of \n\n before the chunk start)
  • §Section — heading (section extraction only)
  • @N — character offset in the original document

The synthesis model can incorporate these tags directly into citations: "As noted in the Introduction of paper.pdf (p.3)..."

MarkItDown Integration

Before chunking, documents go through MarkItDown — a Microsoft library that converts rich document formats to Markdown:

RICH_EXTENSIONS = {".pdf", ".docx", ".xlsx", ".pptx", ".epub", ".htm", ".html"}

def read_file(file_path):
    ext = Path(file_path).suffix.lower()
    if ext in RICH_EXTENSIONS:
        try:
            from markitdown import MarkItDown
            md = MarkItDown()
            result = md.convert(file_path)
            return result.text_content
        except ImportError:
            pass  # graceful fallback if markitdown not installed
    return open(file_path).read()

MarkItDown handles:

  • PDF — text extraction (not OCR; images are not extracted)
  • DOCX — paragraphs, tables, headings
  • XLSX — sheet data as Markdown tables
  • PPTX — slide text and notes
  • EPUB — chapter content
  • HTML/HTM — body text with basic structure

URL enrichment also uses MarkItDown: when a URL is fetched for full-page content, MarkItDown converts the HTML to Markdown before it enters the synthesis context.

The Complete read_file_context() Flow

read_file_context(file_path, task)
  → read_file()                   # MarkItDown conversion if needed
    → len(content) < 12,000?      # threshold check
      → YES: return content verbatim
      → NO: extract_paper_context()
          → count markdown headings
            → ≥ 3 headings: section_extract()      # structured document
            → < 3 headings: semantic_chunk_retrieve() # unstructured document
          → attach provenance tags
          → return context string