Large Document Context
- Describe the two chunker extraction strategies and explain which conditions trigger each
- Interpret a provenance metadata tag and explain what each field communicates to the synthesis model
- Explain how MarkItDown enables rich document conversion and what file types it handles
The Problem with Long Documents
When a task references a file — a PDF, a local Markdown document, a research paper — the naïve approach is to pass the entire file into the synthesis context. This fails in two ways:
Context window overflow. A 200-page PDF at ~500 tokens per page is 100,000 tokens — far beyond most models' effective context windows. The synthesis call fails or produces degenerate output.
Irrelevance. Even within the context window, a 20,000-token document stuffed into the context means the synthesis model must attend over a huge amount of text to find the relevant sections. Signal-to-noise degrades; quality drops.
The chunker addresses both problems by extracting only the relevant portions of a document, with provenance metadata so the model can cite specific passages.
Activation Threshold
CHUNK_THRESHOLD = 12_000 # characters (~3,000 tokens at ~4 chars/token)
def read_file_context(file_path: str, task: str) -> str:
content = read_file(file_path)
if len(content) < CHUNK_THRESHOLD:
return content # small enough to pass directly
return extract_paper_context(content, task, file_path)
Files under ~12,000 characters are passed verbatim. Larger files go through the chunker.
Strategy 1: Section Extraction
For structured documents with 3 or more Markdown headings, the chunker uses section extraction:
SECTION_PRIORITY = [
"Abstract", "Summary", "Conclusion", "Results",
"Introduction", "Discussion", "Methods", "Background"
]
def section_extract(content, char_budget=8000):
sections = parse_markdown_sections(content) # {heading: text}
selected = []
used_chars = 0
# Priority order: high-value sections first
for target in SECTION_PRIORITY:
for heading, text in sections.items():
if target.lower() in heading.lower():
if used_chars + len(text) <= char_budget:
selected.append((heading, text))
used_chars += len(text)
# Fill remaining budget with other sections
for heading, text in sections.items():
if heading not in [s[0] for s in selected]:
if used_chars + len(text) <= char_budget:
selected.append((heading, text))
used_chars += len(text)
return format_with_provenance(selected)
Abstract and Conclusion come first because they compress the most information per character. Introduction and Results follow. The remaining budget is filled by other sections in document order.
Strategy 2: Semantic Chunk Retrieval
For unstructured documents — prose without Markdown headings — the chunker uses an ephemeral ChromaDB vector store:
def semantic_chunk_retrieve(content, task, char_budget=8000):
# Create overlapping windows
window_size = 500 # characters
overlap = 100
chunks = []
for i in range(0, len(content), window_size - overlap):
chunk_text = content[i:i + window_size]
chunks.append({
"text": chunk_text,
"char_offset": i,
"paragraph": content[:i].count("\n\n")
})
# Embed all chunks with all-MiniLM-L6-v2
ephemeral_db = chromadb.EphemeralClient()
collection = ephemeral_db.create_collection("chunks")
collection.add(
documents=[c["text"] for c in chunks],
ids=[str(i) for i in range(len(chunks))],
metadatas=chunks
)
# Retrieve top-K by cosine similarity to task
results = collection.query(query_texts=[task], n_results=10)
top_chunks = sorted(
[chunks[int(id)] for id in results["ids"][0]],
key=lambda c: c["char_offset"] # re-sort to reading order
)
# Assemble within budget
selected_text = ""
for chunk in top_chunks:
if len(selected_text) + len(chunk["text"]) <= char_budget:
selected_text += chunk["text"] + "\n\n"
return selected_text
The re-sort to reading order after retrieval is important: the model reads the selected passages as a coherent sequence, not as a collection of random snippets.
Provenance Metadata
Every extracted section or chunk is tagged with provenance metadata so the synthesis model can cite specific passages:
=== Introduction [source:paper.pdf | p.3 | ¶12 | §Introduction | @4,200] ===
Attention is the core mechanism of the transformer...
Tag fields:
source— file namep.N— estimated page number (using apage_sizecharacter estimate)¶N— paragraph number (count of\n\nbefore the chunk start)§Section— heading (section extraction only)@N— character offset in the original document
The synthesis model can incorporate these tags directly into citations: "As noted in the Introduction of paper.pdf (p.3)..."
MarkItDown Integration
Before chunking, documents go through MarkItDown — a Microsoft library that converts rich document formats to Markdown:
RICH_EXTENSIONS = {".pdf", ".docx", ".xlsx", ".pptx", ".epub", ".htm", ".html"}
def read_file(file_path):
ext = Path(file_path).suffix.lower()
if ext in RICH_EXTENSIONS:
try:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert(file_path)
return result.text_content
except ImportError:
pass # graceful fallback if markitdown not installed
return open(file_path).read()
MarkItDown handles:
- PDF — text extraction (not OCR; images are not extracted)
- DOCX — paragraphs, tables, headings
- XLSX — sheet data as Markdown tables
- PPTX — slide text and notes
- EPUB — chapter content
- HTML/HTM — body text with basic structure
URL enrichment also uses MarkItDown: when a URL is fetched for full-page content, MarkItDown converts the HTML to Markdown before it enters the synthesis context.
The Complete read_file_context() Flow
read_file_context(file_path, task)
→ read_file() # MarkItDown conversion if needed
→ len(content) < 12,000? # threshold check
→ YES: return content verbatim
→ NO: extract_paper_context()
→ count markdown headings
→ ≥ 3 headings: section_extract() # structured document
→ < 3 headings: semantic_chunk_retrieve() # unstructured document
→ attach provenance tags
→ return context string