Harness Engineering for AI Agents · Self-Improvement

QLoRA Fine-Tuning

15 min read

By the end of this reading you will be able to:

Describe the Nanda 8-move annotated abstract framework and explain why fine-tuning a domain-specific annotator outperforms a general-purpose model for this task
Explain how QLoRA reduces memory requirements compared to full fine-tuning and identify the key hyperparameters in finetune_annotate.py
Trace the pipeline from raw annotated abstracts through build_finetune_from_annotations.py to the nanda-annotator Ollama model

Why Domain Fine-Tuning?

The harness uses a QLoRA-fine-tuned model, nanda-annotator, for annotating academic paper abstracts using the Nanda 8-move framework. Why fine-tune rather than use the general-purpose producer model?

Three reasons:

Consistency. The Nanda framework requires applying 8 specific moves to each abstract in a prescribed order. A general-purpose model applies these inconsistently across abstracts — different move ordering, missing moves, variable depth per move. A fine-tuned model has seen hundreds of examples and applies the framework consistently.
Efficiency. The annotator task is simple enough that a 7B fine-tuned model outperforms a 32B general-purpose model on this specific task. The fine-tuning specializes the model for the annotation distribution.
Speed. Annotating 445 papers at 7B scale takes a fraction of the time at 32B scale.

The Nanda 8-Move Framework

The Nanda Annotated Abstract is an analytical framework for evaluating ML research papers. The 8 moves are:

Move	What it identifies
1. Claim	The paper's central claim or contribution
2. Evidence	What evidence is provided
3. Method	How the claim is established
4. Scope	What the claim applies to
5. Limitation	What the paper acknowledges it cannot do
6. Comparison	How it compares to prior work
7. Implication	What follows if the claim is true
8. Novelty	What is genuinely new

Applied to a paper abstract, this framework produces a structured 8-field JSON annotation that can be aggregated, filtered, and searched across a corpus of papers.

The Training Dataset

The fine-tuning dataset was built in two passes:

Gold annotations — 718 abstracts manually annotated by the primary researcher, used as the high-quality training signal. These were collected from arxiv_agentic_papers.csv and manually reviewed.

Agent annotations — 2,400 additional abstracts annotated by the base Qwen2.5-7B model, then filtered through curator.py (5-persona quality filter). The curator keeps annotations that pass 3 of 5 personas, rejecting ~30% as low quality.

build_finetune_from_annotations.py merges these two sources, preferring gold annotations when available:

def build_dataset(gold_csv, agent_csv, curated_csv=None):
    gold = {row['arxiv_id']: row for row in read_csv(gold_csv)}
    agent = {row['arxiv_id']: row for row in read_csv(agent_csv)}

    # Use curated_csv if available (higher quality agent annotations)
    if curated_csv:
        curated = {row['arxiv_id']: row for row in read_csv(curated_csv)}
        agent.update(curated)  # curated overrides unfiltered agent annotations

    # Merge: gold takes precedence
    merged = {**agent, **gold}

    # Convert to {prompt, completion} pairs
    examples = []
    for arxiv_id, row in merged.items():
        examples.append({
            "prompt": format_annotation_prompt(row['abstract']),
            "completion": format_annotation_output(row)
        })
    return examples

Final dataset: 3,118 examples after deduplication.