Harness Engineering for AI Agents · Self-Improvement

QLoRA Fine-Tuning

15 min read
By the end of this reading you will be able to:
  • Describe the Nanda 8-move annotated abstract framework and explain why fine-tuning a domain-specific annotator outperforms a general-purpose model for this task
  • Explain how QLoRA reduces memory requirements compared to full fine-tuning and identify the key hyperparameters in finetune_annotate.py
  • Trace the pipeline from raw annotated abstracts through build_finetune_from_annotations.py to the nanda-annotator Ollama model

Why Domain Fine-Tuning?

The harness uses a QLoRA-fine-tuned model, nanda-annotator, for annotating academic paper abstracts using the Nanda 8-move framework. Why fine-tune rather than use the general-purpose producer model?

Three reasons:

  1. Consistency. The Nanda framework requires applying 8 specific moves to each abstract in a prescribed order. A general-purpose model applies these inconsistently across abstracts — different move ordering, missing moves, variable depth per move. A fine-tuned model has seen hundreds of examples and applies the framework consistently.

  2. Efficiency. The annotator task is simple enough that a 7B fine-tuned model outperforms a 32B general-purpose model on this specific task. The fine-tuning specializes the model for the annotation distribution.

  3. Speed. Annotating 445 papers at 7B scale takes a fraction of the time at 32B scale.

The Nanda 8-Move Framework

The Nanda Annotated Abstract is an analytical framework for evaluating ML research papers. The 8 moves are:

Move What it identifies
1. Claim The paper's central claim or contribution
2. Evidence What evidence is provided
3. Method How the claim is established
4. Scope What the claim applies to
5. Limitation What the paper acknowledges it cannot do
6. Comparison How it compares to prior work
7. Implication What follows if the claim is true
8. Novelty What is genuinely new

Applied to a paper abstract, this framework produces a structured 8-field JSON annotation that can be aggregated, filtered, and searched across a corpus of papers.

The Training Dataset

The fine-tuning dataset was built in two passes:

Gold annotations — 718 abstracts manually annotated by the primary researcher, used as the high-quality training signal. These were collected from arxiv_agentic_papers.csv and manually reviewed.

Agent annotations — 2,400 additional abstracts annotated by the base Qwen2.5-7B model, then filtered through curator.py (5-persona quality filter). The curator keeps annotations that pass 3 of 5 personas, rejecting ~30% as low quality.

build_finetune_from_annotations.py merges these two sources, preferring gold annotations when available:

def build_dataset(gold_csv, agent_csv, curated_csv=None):
    gold = {row['arxiv_id']: row for row in read_csv(gold_csv)}
    agent = {row['arxiv_id']: row for row in read_csv(agent_csv)}

    # Use curated_csv if available (higher quality agent annotations)
    if curated_csv:
        curated = {row['arxiv_id']: row for row in read_csv(curated_csv)}
        agent.update(curated)  # curated overrides unfiltered agent annotations

    # Merge: gold takes precedence
    merged = {**agent, **gold}

    # Convert to {prompt, completion} pairs
    examples = []
    for arxiv_id, row in merged.items():
        examples.append({
            "prompt": format_annotation_prompt(row['abstract']),
            "completion": format_annotation_output(row)
        })
    return examples

Final dataset: 3,118 examples after deduplication.

QLoRA Fine-Tuning

QLoRA (Quantized Low-Rank Adaptation) fine-tunes a quantized base model by adding trainable low-rank adapter matrices to the attention layers. The base model weights remain frozen; only the adapter weights are trained. This reduces memory requirements from ~14 GB (full fine-tune of 7B) to ~6 GB.

Key hyperparameters in finetune_annotate.py:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization of base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # NormalFloat4 — better than int4 for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True   # quantize the quantization constants too
)

# LoRA config
lora_config = LoraConfig(
    r=16,              # rank — higher = more capacity, more memory
    lora_alpha=32,     # scaling factor (lora_alpha/r is the effective learning rate scaling)
    target_modules=["q_proj", "v_proj"],  # which attention matrices to adapt
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto"
)
model = get_peft_model(base_model, lora_config)

Training:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="finetune_output/",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # effective batch size = 16
    learning_rate=2e-4,
    bf16=True,                        # bfloat16 for NVIDIA 30xx/40xx GPUs
    save_steps=100,                   # checkpoint every 100 steps
    logging_steps=10,
    report_to="none"
)
# Run fine-tuning:
python finetune_annotate.py
python finetune_annotate.py --resume finetune_output/checkpoint-300

Exporting to Ollama

After training, the adapter is merged into the base model and exported as GGUF (the format Ollama uses):

# Merge adapter into base model:
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct')
model = PeftModel.from_pretrained(base, 'finetune_output/')
merged = model.merge_and_unload()
merged.save_pretrained('finetune_output/merged/')
"

# Convert to GGUF with llama.cpp:
./llama.cpp/convert_hf_to_gguf.py finetune_output/merged/ \
  --outfile finetune_output/nanda-annotator.gguf --outtype q4_K_M

# Create Ollama model:
ollama create nanda-annotator -f finetune_output/Modelfile

The finetune_output/Modelfile specifies the GGUF path and custom stop tokens that match the annotation output format:

FROM ./nanda-annotator.gguf

PARAMETER stop "<|im_end|>"
PARAMETER stop "</annotation>"
PARAMETER temperature 0.1
PARAMETER num_ctx 4096

Once created, nanda-annotator is invokable like any Ollama model:

python agent.py "/annotate Survey of RAG techniques save to output.md"
# → auto-activates /annotate skill
# → uses nanda-annotator for annotation if available in registry