QLoRA Fine-Tuning
- Describe the Nanda 8-move annotated abstract framework and explain why fine-tuning a domain-specific annotator outperforms a general-purpose model for this task
- Explain how QLoRA reduces memory requirements compared to full fine-tuning and identify the key hyperparameters in finetune_annotate.py
- Trace the pipeline from raw annotated abstracts through build_finetune_from_annotations.py to the nanda-annotator Ollama model
Why Domain Fine-Tuning?
The harness uses a QLoRA-fine-tuned model, nanda-annotator, for annotating academic paper abstracts using the Nanda 8-move framework. Why fine-tune rather than use the general-purpose producer model?
Three reasons:
Consistency. The Nanda framework requires applying 8 specific moves to each abstract in a prescribed order. A general-purpose model applies these inconsistently across abstracts — different move ordering, missing moves, variable depth per move. A fine-tuned model has seen hundreds of examples and applies the framework consistently.
Efficiency. The annotator task is simple enough that a 7B fine-tuned model outperforms a 32B general-purpose model on this specific task. The fine-tuning specializes the model for the annotation distribution.
Speed. Annotating 445 papers at 7B scale takes a fraction of the time at 32B scale.
The Nanda 8-Move Framework
The Nanda Annotated Abstract is an analytical framework for evaluating ML research papers. The 8 moves are:
| Move | What it identifies |
|---|---|
| 1. Claim | The paper's central claim or contribution |
| 2. Evidence | What evidence is provided |
| 3. Method | How the claim is established |
| 4. Scope | What the claim applies to |
| 5. Limitation | What the paper acknowledges it cannot do |
| 6. Comparison | How it compares to prior work |
| 7. Implication | What follows if the claim is true |
| 8. Novelty | What is genuinely new |
Applied to a paper abstract, this framework produces a structured 8-field JSON annotation that can be aggregated, filtered, and searched across a corpus of papers.
The Training Dataset
The fine-tuning dataset was built in two passes:
Gold annotations — 718 abstracts manually annotated by the primary researcher, used as the high-quality training signal. These were collected from arxiv_agentic_papers.csv and manually reviewed.
Agent annotations — 2,400 additional abstracts annotated by the base Qwen2.5-7B model, then filtered through curator.py (5-persona quality filter). The curator keeps annotations that pass 3 of 5 personas, rejecting ~30% as low quality.
build_finetune_from_annotations.py merges these two sources, preferring gold annotations when available:
def build_dataset(gold_csv, agent_csv, curated_csv=None):
gold = {row['arxiv_id']: row for row in read_csv(gold_csv)}
agent = {row['arxiv_id']: row for row in read_csv(agent_csv)}
# Use curated_csv if available (higher quality agent annotations)
if curated_csv:
curated = {row['arxiv_id']: row for row in read_csv(curated_csv)}
agent.update(curated) # curated overrides unfiltered agent annotations
# Merge: gold takes precedence
merged = {**agent, **gold}
# Convert to {prompt, completion} pairs
examples = []
for arxiv_id, row in merged.items():
examples.append({
"prompt": format_annotation_prompt(row['abstract']),
"completion": format_annotation_output(row)
})
return examples
Final dataset: 3,118 examples after deduplication.
QLoRA Fine-Tuning
QLoRA (Quantized Low-Rank Adaptation) fine-tunes a quantized base model by adding trainable low-rank adapter matrices to the attention layers. The base model weights remain frozen; only the adapter weights are trained. This reduces memory requirements from ~14 GB (full fine-tune of 7B) to ~6 GB.
Key hyperparameters in finetune_annotate.py:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization of base model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — better than int4 for LLMs
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # quantize the quantization constants too
)
# LoRA config
lora_config = LoraConfig(
r=16, # rank — higher = more capacity, more memory
lora_alpha=32, # scaling factor (lora_alpha/r is the effective learning rate scaling)
target_modules=["q_proj", "v_proj"], # which attention matrices to adapt
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
quantization_config=bnb_config,
device_map="auto"
)
model = get_peft_model(base_model, lora_config)
Training:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="finetune_output/",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size = 16
learning_rate=2e-4,
bf16=True, # bfloat16 for NVIDIA 30xx/40xx GPUs
save_steps=100, # checkpoint every 100 steps
logging_steps=10,
report_to="none"
)
# Run fine-tuning:
python finetune_annotate.py
python finetune_annotate.py --resume finetune_output/checkpoint-300
Exporting to Ollama
After training, the adapter is merged into the base model and exported as GGUF (the format Ollama uses):
# Merge adapter into base model:
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct')
model = PeftModel.from_pretrained(base, 'finetune_output/')
merged = model.merge_and_unload()
merged.save_pretrained('finetune_output/merged/')
"
# Convert to GGUF with llama.cpp:
./llama.cpp/convert_hf_to_gguf.py finetune_output/merged/ \
--outfile finetune_output/nanda-annotator.gguf --outtype q4_K_M
# Create Ollama model:
ollama create nanda-annotator -f finetune_output/Modelfile
The finetune_output/Modelfile specifies the GGUF path and custom stop tokens that match the annotation output format:
FROM ./nanda-annotator.gguf
PARAMETER stop "<|im_end|>"
PARAMETER stop "</annotation>"
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
Once created, nanda-annotator is invokable like any Ollama model:
python agent.py "/annotate Survey of RAG techniques save to output.md"
# → auto-activates /annotate skill
# → uses nanda-annotator for annotation if available in registry