Harness Engineering for AI Agents · Self-Improvement

The Literature Review Pipeline

12 min read
By the end of this reading you will be able to:
  • Trace the literature review pipeline from arXiv fetch through Semantic Scholar enrichment, 5-persona curation, annotation, and Jinja2 rendering
  • Explain how Semantic Scholar hub scores identify high-influence papers and how gap candidates are identified from citation graph analysis
  • Distinguish the survey template from the gaps template and explain when each is appropriate

The Pipeline

The literature review pipeline is a /lit-review skill that automates the full lifecycle of a systematic literature review — from paper discovery through final rendered document. It produces either a comprehensive survey or a gap analysis.

arxiv_fetch.py
  → semantic_scholar.py   (citation enrichment)
    → curator.py          (5-persona quality filter)
      → annotate_abstracts.py  (Nanda 8-move annotation + Wiggum)
        → synthesize()    (cluster → synthesize across papers)
          → Jinja2 render (survey or gaps template)

Stage 1: arXiv Fetch

# Fetch papers matching a query:
python arxiv_fetch.py "agentic LLM harness engineering" --max 300
# → arxiv_agentic_llm_harness.csv (300 papers)

# With date filter (for incremental updates):
python arxiv_fetch.py "prompt injection" --after 2024-06-01 --append existing.csv

# Inspect existing dataset:
python arxiv_fetch.py --stats arxiv_agentic_papers.csv

The CSV schema: arxiv_id, title, authors, published, abstract, url. Deduplication by arxiv_id prevents adding the same paper twice across incremental fetches.

Stage 2: Semantic Scholar Enrichment

ArXiv metadata lacks citation data. semantic_scholar.py enriches each paper with Semantic Scholar API data:

python semantic_scholar.py arxiv_agentic_papers.csv

Added fields per paper:

  • citation_count — total citations in the S2 graph
  • influential_citation_count — citations from papers that themselves have high citation counts
  • hub_score — eigenvector centrality in the local citation graph (identifies papers cited by many other important papers)
  • references — list of arxiv_ids this paper cites
  • citations — list of arxiv_ids that cite this paper

The hub score is the key signal for identifying foundational papers: a paper with a high hub score is not just popular — it is cited by other influential papers, making it structurally important to the field.

# Gap candidates: papers with high hub score but low direct citation count
# These are referenced by important work but not widely read
python semantic_scholar.py arxiv_agentic_papers.csv --fetch-gaps 20
# → fetches 20 gap candidate papers not yet in the CSV and appends them

Stage 3: 5-Persona Curation

curator.py runs 5 simulated reviewer personas over each abstract, deciding whether to include the paper in the curated set:

PERSONAS = [
    "ML practitioner building production agents",
    "Academic researcher studying multi-agent coordination",
    "Security engineer assessing prompt injection risks",
    "ML engineer focused on inference optimization",
    "Technical writer surveying the field"
]

def curate_paper(abstract, personas, model):
    votes = []
    for persona in personas:
        prompt = (
            f"You are a {persona} evaluating a paper abstract for relevance. "
            f"Abstract: {abstract}\n\n"
            f"Is this paper relevant to your work? Respond YES or NO with a one-sentence reason."
        )
        response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}])
        votes.append(response["message"]["content"].strip().startswith("YES"))

    # Keep if majority vote (3/5 or better)
    return sum(votes) >= 3

The curation log (curation_log.jsonl) records each vote with reason, enabling retrospective analysis of which papers were controversial and why.

Stage 4: Annotation

python annotate_abstracts.py arxiv_agentic_papers.csv \
    --model nanda-annotator \
    --out annotated/

Each abstract is annotated using the Nanda 8-move framework (from the previous reading). For high-priority papers (top hub score or >100 citations), the annotation additionally passes through the Wiggum loop:

python agent.py "/annotate /wiggum https://arxiv.org/abs/2308.04079 output.md"

The combined /annotate /wiggum invocation produces an annotation and then evaluates it for quality — catching cases where the annotator misidentifies moves or misses key claims.

Stage 5: Synthesis and Rendering

After curation and annotation, the pipeline clusters the curated papers by topic (using ChromaDB embeddings over annotation facts) and synthesizes a section for each cluster:

def synthesize_cluster(cluster_papers, template_type, producer_model):
    annotations = "\n\n".join(
        format_annotation(p) for p in cluster_papers
    )
    prompt = CLUSTER_SYNTHESIS_PROMPT.format(
        template_type=template_type,
        annotations=annotations
    )
    return call_producer(prompt, producer_model)

The final document is rendered through a Jinja2 template:

Survey template (templates/lit_review_survey.j2) — academic-style survey: introduction, methodology, thematic sections, comparison tables, conclusion.

Gaps template (templates/lit_review_gaps.j2) — gap analysis: what exists, what is missing, which papers to read next, open research questions.

# Run the full pipeline via /lit-review skill:
python agent.py "/lit-review agentic LLM harness engineering save to review.md"

# With gap focus:
python agent.py "/lit-review --template gaps prompt injection save to gaps.md"

# Using existing CSV (skip fetch):
python agent.py "/lit-review --csv arxiv_agentic_papers.csv --no-fetch \
  --template survey agentic LLM save to survey.md"

What the Pipeline Produces

For a corpus of 300 papers on "agentic LLM harness engineering":

  • After S2 enrichment: citation metadata + hub scores for all 300 papers
  • After curation: ~200 papers pass (majority vote)
  • After annotation: ~200 structured Nanda 8-move annotations
  • After synthesis: 10–15 topical sections, ~8,000–12,000 words
  • Final document: a peer-review-quality survey or gap analysis

The whole pipeline, running locally with cached search results, takes approximately 3–4 hours for 300 papers.