Harness Engineering for AI Agents · Context Engineering & Memory

Large Document Context

12 min read

By the end of this reading you will be able to:

Describe the two chunker extraction strategies and explain which conditions trigger each
Interpret a provenance metadata tag and explain what each field communicates to the synthesis model
Explain how MarkItDown enables rich document conversion and what file types it handles

The Problem with Long Documents

When a task references a file — a PDF, a local Markdown document, a research paper — the naïve approach is to pass the entire file into the synthesis context. This fails in two ways:

Context window overflow. A 200-page PDF at ~500 tokens per page is 100,000 tokens — far beyond most models' effective context windows. The synthesis call fails or produces degenerate output.
Irrelevance. Even within the context window, a 20,000-token document stuffed into the context means the synthesis model must attend over a huge amount of text to find the relevant sections. Signal-to-noise degrades; quality drops.

The chunker addresses both problems by extracting only the relevant portions of a document, with provenance metadata so the model can cite specific passages.

Activation Threshold

CHUNK_THRESHOLD = 12_000  # characters (~3,000 tokens at ~4 chars/token)

def read_file_context(file_path: str, task: str) -> str:
    content = read_file(file_path)
    if len(content) < CHUNK_THRESHOLD:
        return content  # small enough to pass directly
    return extract_paper_context(content, task, file_path)

Files under ~12,000 characters are passed verbatim. Larger files go through the chunker.

Strategy 1: Section Extraction

For structured documents with 3 or more Markdown headings, the chunker uses section extraction:

SECTION_PRIORITY = [
    "Abstract", "Summary", "Conclusion", "Results",
    "Introduction", "Discussion", "Methods", "Background"
]

def section_extract(content, char_budget=8000):
    sections = parse_markdown_sections(content)  # {heading: text}
    selected = []
    used_chars = 0

    # Priority order: high-value sections first
    for target in SECTION_PRIORITY:
        for heading, text in sections.items():
            if target.lower() in heading.lower():
                if used_chars + len(text) <= char_budget:
                    selected.append((heading, text))
                    used_chars += len(text)

    # Fill remaining budget with other sections
    for heading, text in sections.items():
        if heading not in [s[0] for s in selected]:
            if used_chars + len(text) <= char_budget:
                selected.append((heading, text))
                used_chars += len(text)

    return format_with_provenance(selected)

Abstract and Conclusion come first because they compress the most information per character. Introduction and Results follow. The remaining budget is filled by other sections in document order.

Strategy 2: Semantic Chunk Retrieval

For unstructured documents — prose without Markdown headings — the chunker uses an ephemeral ChromaDB vector store:

def semantic_chunk_retrieve(content, task, char_budget=8000):
    # Create overlapping windows
    window_size = 500   # characters
    overlap = 100
    chunks = []
    for i in range(0, len(content), window_size - overlap):
        chunk_text = content[i:i + window_size]
        chunks.append({
            "text": chunk_text,
            "char_offset": i,
            "paragraph": content[:i].count("\n\n")
        })

    # Embed all chunks with all-MiniLM-L6-v2
    ephemeral_db = chromadb.EphemeralClient()
    collection = ephemeral_db.create_collection("chunks")
    collection.add(
        documents=[c["text"] for c in chunks],
        ids=[str(i) for i in range(len(chunks))],
        metadatas=chunks
    )

    # Retrieve top-K by cosine similarity to task
    results = collection.query(query_texts=[task], n_results=10)
    top_chunks = sorted(
        [chunks[int(id)] for id in results["ids"][0]],
        key=lambda c: c["char_offset"]  # re-sort to reading order
    )

    # Assemble within budget
    selected_text = ""
    for chunk in top_chunks:
        if len(selected_text) + len(chunk["text"]) <= char_budget:
            selected_text += chunk["text"] + "\n\n"
    return selected_text

The re-sort to reading order after retrieval is important: the model reads the selected passages as a coherent sequence, not as a collection of random snippets.

Provenance Metadata

Every extracted section or chunk is tagged with provenance metadata so the synthesis model can cite specific passages:

=== Introduction [source:paper.pdf | p.3 | ¶12 | §Introduction | @4,200] ===
Attention is the core mechanism of the transformer...

Tag fields:

source — file name
p.N — estimated page number (using a page_size character estimate)
¶N — paragraph number (count of \n\n before the chunk start)
§Section — heading (section extraction only)
@N — character offset in the original document

The synthesis model can incorporate these tags directly into citations: "As noted in the Introduction of paper.pdf (p.3)..."

MarkItDown Integration

Before chunking, documents go through MarkItDown — a Microsoft library that converts rich document formats to Markdown:

RICH_EXTENSIONS = {".pdf", ".docx", ".xlsx", ".pptx", ".epub", ".htm", ".html"}

def read_file(file_path):
    ext = Path(file_path).suffix.lower()
    if ext in RICH_EXTENSIONS:
        try:
            from markitdown import MarkItDown
            md = MarkItDown()
            result = md.convert(file_path)
            return result.text_content
        except ImportError:
            pass  # graceful fallback if markitdown not installed
    return open(file_path).read()

MarkItDown handles:

PDF — text extraction (not OCR; images are not extracted)
DOCX — paragraphs, tables, headings
XLSX — sheet data as Markdown tables
PPTX — slide text and notes
EPUB — chapter content
HTML/HTM — body text with basic structure

URL enrichment also uses MarkItDown: when a URL is fetched for full-page content, MarkItDown converts the HTML to Markdown before it enters the synthesis context.

The Complete read_file_context() Flow

read_file_context(file_path, task)
  → read_file()                   # MarkItDown conversion if needed
    → len(content) < 12,000?      # threshold check
      → YES: return content verbatim
      → NO: extract_paper_context()
          → count markdown headings
            → ≥ 3 headings: section_extract()      # structured document
            → < 3 headings: semantic_chunk_retrieve() # unstructured document
          → attach provenance tags
          → return context string

References

MarkItDown — MarkItDown — Microsoft document-to-Markdown converter

Previous Take Quiz →

Large Document Context

The Problem with Long Documents

Activation Threshold

Strategy 1: Section Extraction

Strategy 2: Semantic Chunk Retrieval

Provenance Metadata

MarkItDown Integration

The Complete read_file_context() Flow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact