Back to Projects Benchmark study · June 2026

Local LLM Showdown

Which open model should you actually run on a single 32 GB GPU? I put 21 quantized GGUF models through 41 real tasks — coding, agents, RAG, chat, creative, ops and raw speed — all on an AMD R9700 with llama.cpp. Here is the honest scoreboard, down to the actual answers each model gave.

21models

41tasks

7categories

32 GBVRAM target

4days of runs

See the leaderboard How it works

How to read this

A plain-language guide

Every model here runs entirely on your own machine — no cloud, no API bill, no data leaving the building. The trade-off is that a single 32 GB GPU can only hold so much, so the real question isn't "what's the best model" but "what's the best model that fits." Tap any term to unpack it.

Quantization (Q4 / Q6 / Q8)

Compressing a model so it fits in limited memory. A lower number (Q4) is smaller and faster but slightly less precise; a higher number (Q8) is closer to the original but heavier. Most models here are tested at several quant levels so you can see the trade.

Context window

How much text the model can "see" at once — prompt plus its own answer. Measured in tokens (~¾ of a word). Bigger context lets it read long documents or whole codebases, but eats more VRAM.

VRAM fit

Whether the model fits in the 32 GB GPU and how much room is left. green = comfortable, yellow = committed, orange = tight, red = at the ceiling.

Tokens / second (speed)

How fast the model writes its answer. Higher is snappier. Anything above ~30 tok/s feels interactive; below ~15 starts to drag for long replies.

MTP / speculative decoding

A speed trick: a tiny "draft" predicts the next few tokens and the big model checks them in one pass. Acceptance % is how often the guesses were right — higher means a bigger free speed-up.

Pass-rate vs quality score

Pass-rate is the share of the 41 tasks the model actually handled well. Quality score (0–1) is a finer grade from automated validators plus a rubric. A model can pass yet score lower if the answer was correct but sloppy.

Harness gate vs model pass

The harness gate is a strict automated check (valid JSON, exact phrase, length floor). The model pass asks the softer question: was the answer genuinely good? Both are reported so you can see where a model was right but tripped a technicality.

The 7 task categories

Coding (repair, multi-file fixes, review), Agentic (tool use, injection resistance), RAG (cited answers from documents), Chat (helpful everyday replies), Creative (constrained writing), Ops (command-risk, log triage) and Perf (latency & fit probes).

Methodology

How every answer was scored

Each model is driven through the same 41-task suite by the AI Flight Recorder harness, which talks to a local llama.cpp server over an OpenAI-compatible API and records both the answer and a stream of telemetry. Answers pass through a multi-stage gate before earning a score.

flowchart LR subgraph H["AI Flight Recorder harness"] direction TB TP["41-task suite
7 categories"] --> PX["OpenAI-compatible
proxy"] end PX -->|"HTTP"| LS["llama.cpp server
Vulkan / ROCm"] LS <--> GPU["AMD R9700
32 GB VRAM"] LS -->|"tokens + telemetry"| G1 subgraph S["Multi-stage scoring"] direction TB G1["Validity gate
HTTP 200 · non-empty"] --> G2["Task validators
unit tests · JSON · citations"] G2 --> G3["Rubric + local judge
style · helpfulness"] G3 --> SC["Score 0–1
pass / review / fail"] end SC --> RP["Per-run report
+ VRAM / speed / MTP"]

One prompt at a time, the harness calls the local model, captures the streamed answer plus live VRAM and speed telemetry, then runs the answer through three gates before it earns a score.

Composite score weighting

When the results are rolled into a single decision, this is how the five factors are weighted.

What "pass" actually means

A model can clear the strict harness gate (valid JSON, exact phrasing, length floor) yet still be a weak answer — or trip the gate on a technicality while being genuinely useful. So every run reports two numbers: the harness pass-rate (machine-strict) and the model pass-rate (was it actually good?). The leaderboard ranks on the latter, because that's what you feel when you use it.

Full test plan ↗ Scoring methodology ↗ Earlier R25 Qwen report ↗

The scoreboard

Leaderboard

All 21 decision-quality runs, ranked by how well the model actually did. Click any column to sort, filter by family, or search. Click a row for the full per-task report and the model's real answers.

#	Model	Quality	Speed	VRAM	MTP	Verdict

Visual comparisons

Find your sweet spot

The charts respond to the family filter above. The headline question — quality versus speed, and what it costs in memory — is the scatter below.

Quality vs. speed

Up and to the right is better: higher pass-rate, faster generation. Bubble size is peak VRAM, colour is model family. Click a bubble to open its report.

Category strengths

Pass-rate per task category for the top models in view — where each one shines or struggles.

Speculative decoding payoff

Generation speed of the MTP-enabled runs, labelled with their draft-acceptance rate — the free speed-up.

Memory footprint vs. the 32 GB ceiling

Peak VRAM each run reached. The bar fills toward the hard 32 GB limit of the R9700 — longer bars are closer to falling out of memory.

Where models stumble

Failure types per run. Darker = more of that failure; empty rows are the clean runs.

The payoff

Best pick by use case

If you just want an answer, here it is — the standout for each job, derived straight from the category scores. Numbers, not vibes.

Transparency

All runs & archive

Beyond the 21 decision-quality runs above, the campaign included quick smoke tests, preflight probes and shakedown runs used to dial in settings. They're kept here for completeness.

Show archived runs

Run	Kind	Tasks	Avg score	Tok/s	MTP	VRAM