Harness Engineering for AI Agents · Self-Improvement

The Literature Review Pipeline

12 min read

By the end of this reading you will be able to:

Trace the literature review pipeline from arXiv fetch through Semantic Scholar enrichment, 5-persona curation, annotation, and Jinja2 rendering
Explain how Semantic Scholar hub scores identify high-influence papers and how gap candidates are identified from citation graph analysis
Distinguish the survey template from the gaps template and explain when each is appropriate

The Pipeline

The literature review pipeline is a /lit-review skill that automates the full lifecycle of a systematic literature review — from paper discovery through final rendered document. It produces either a comprehensive survey or a gap analysis.

arxiv_fetch.py
  → semantic_scholar.py   (citation enrichment)
    → curator.py          (5-persona quality filter)
      → annotate_abstracts.py  (Nanda 8-move annotation + Wiggum)
        → synthesize()    (cluster → synthesize across papers)
          → Jinja2 render (survey or gaps template)

Stage 1: arXiv Fetch

# Fetch papers matching a query:
python arxiv_fetch.py "agentic LLM harness engineering" --max 300
# → arxiv_agentic_llm_harness.csv (300 papers)

# With date filter (for incremental updates):
python arxiv_fetch.py "prompt injection" --after 2024-06-01 --append existing.csv

# Inspect existing dataset:
python arxiv_fetch.py --stats arxiv_agentic_papers.csv

The CSV schema: arxiv_id, title, authors, published, abstract, url. Deduplication by arxiv_id prevents adding the same paper twice across incremental fetches.

Stage 2: Semantic Scholar Enrichment

ArXiv metadata lacks citation data. semantic_scholar.py enriches each paper with Semantic Scholar API data:

python semantic_scholar.py arxiv_agentic_papers.csv

Added fields per paper:

citation_count — total citations in the S2 graph
influential_citation_count — citations from papers that themselves have high citation counts
hub_score — eigenvector centrality in the local citation graph (identifies papers cited by many other important papers)
references — list of arxiv_ids this paper cites
citations — list of arxiv_ids that cite this paper

The hub score is the key signal for identifying foundational papers: a paper with a high hub score is not just popular — it is cited by other influential papers, making it structurally important to the field.

# Gap candidates: papers with high hub score but low direct citation count
# These are referenced by important work but not widely read
python semantic_scholar.py arxiv_agentic_papers.csv --fetch-gaps 20
# → fetches 20 gap candidate papers not yet in the CSV and appends them

Stage 3: 5-Persona Curation

curator.py runs 5 simulated reviewer personas over each abstract, deciding whether to include the paper in the curated set:

PERSONAS = [
    "ML practitioner building production agents",
    "Academic researcher studying multi-agent coordination",
    "Security engineer assessing prompt injection risks",
    "ML engineer focused on inference optimization",
    "Technical writer surveying the field"
]

def curate_paper(abstract, personas, model):
    votes = []
    for persona in personas:
        prompt = (
            f"You are a {persona} evaluating a paper abstract for relevance. "
            f"Abstract: {abstract}\n\n"
            f"Is this paper relevant to your work? Respond YES or NO with a one-sentence reason."
        )
        response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}])
        votes.append(response["message"]["content"].strip().startswith("YES"))

    # Keep if majority vote (3/5 or better)
    return sum(votes) >= 3

The curation log (curation_log.jsonl) records each vote with reason, enabling retrospective analysis of which papers were controversial and why.

Stage 4: Annotation

python annotate_abstracts.py arxiv_agentic_papers.csv \
    --model nanda-annotator \
    --out annotated/

Each abstract is annotated using the Nanda 8-move framework (from the previous reading). For high-priority papers (top hub score or >100 citations), the annotation additionally passes through the Wiggum loop:

python agent.py "/annotate /wiggum https://arxiv.org/abs/2308.04079 output.md"

The combined /annotate /wiggum invocation produces an annotation and then evaluates it for quality — catching cases where the annotator misidentifies moves or misses key claims.

Stage 5: Synthesis and Rendering

After curation and annotation, the pipeline clusters the curated papers by topic (using ChromaDB embeddings over annotation facts) and synthesizes a section for each cluster:

def synthesize_cluster(cluster_papers, template_type, producer_model):
    annotations = "\n\n".join(
        format_annotation(p) for p in cluster_papers
    )
    prompt = CLUSTER_SYNTHESIS_PROMPT.format(
        template_type=template_type,
        annotations=annotations
    )
    return call_producer(prompt, producer_model)

The final document is rendered through a Jinja2 template:

Survey template (templates/lit_review_survey.j2) — academic-style survey: introduction, methodology, thematic sections, comparison tables, conclusion.

Gaps template (templates/lit_review_gaps.j2) — gap analysis: what exists, what is missing, which papers to read next, open research questions.

# Run the full pipeline via /lit-review skill:
python agent.py "/lit-review agentic LLM harness engineering save to review.md"

# With gap focus:
python agent.py "/lit-review --template gaps prompt injection save to gaps.md"

# Using existing CSV (skip fetch):
python agent.py "/lit-review --csv arxiv_agentic_papers.csv --no-fetch \
  --template survey agentic LLM save to survey.md"

What the Pipeline Produces

For a corpus of 300 papers on "agentic LLM harness engineering":

After S2 enrichment: citation metadata + hub scores for all 300 papers
After curation: ~200 papers pass (majority vote)
After annotation: ~200 structured Nanda 8-move annotations
After synthesis: 10–15 topical sections, ~8,000–12,000 words
Final document: a peer-review-quality survey or gap analysis

The whole pipeline, running locally with cached search results, takes approximately 3–4 hours for 300 papers.

References

Semantic Scholar API — Semantic Scholar Open Research Corpus

Previous Take Quiz →

The Literature Review Pipeline

The Pipeline

Stage 1: arXiv Fetch

Stage 2: Semantic Scholar Enrichment

Stage 3: 5-Persona Curation

Stage 4: Annotation

Stage 5: Synthesis and Rendering

What the Pipeline Produces

Privacy Policy

What we collect

What we don't collect

Your choices

Contact