Back to Projects

Model Testing

A practical benchmark journal for local GGUF model testing across Intel B60 and AMD AI Pro R9700 experiments, with llama.cpp, ROCm/Vulkan backend work, long-context probes, MTP speculative decoding, and agent-oriented model profiles.

Evaluating Vulkan vs ROCm compute paths, hardware scaling metrics, speculative decoding (MTP) draft acceptance loops, reasoning budgets, and agent tool-calling parsing failures on Intel and AMD GPUs.

llama.cpp GGUF Intel B60 AMD R9700 ROCm Vulkan MTP
GGUF Registry Qwen3.6-35B-IQ4.gguf ctx: 262k / spec: MTP Qwen3.6-27B-Q5_K.gguf ctx: 192k / spec: MTP Gemma-4-26B-Q4_K.gguf ctx: 262k / spec: None Devstral-24B-Q5_K.gguf ctx: 160k / spec: None Qwen3.5-9B-Q4_K.gguf ctx: 128k / SSH-Agent MODEL REPOSITORY LLAMA.CPP CORE MTP SPECULATIVE LOOP decoding: 35B / draft: 3B Vulkan ACTIVE ROCm STANDBY COMPUTE BACKENDS llama.cpp Metrics Gen Speed: 103.0 TPS MTP Accept: 86.7% Prompt Speed: 953.8 TPS Context Window: 262,144 token_38290: "def" MTP [accept] code blocks token_04921: " parse_duration" MTP [accept] tokens thinking... [budget: 8192] eval time: 71.7s | ok vulkan render: queue flush METRICS ENGINE
Active benchmark journal

Data Availability: Hardware state, model registries, current model/backend/context, and live llama.cpp counters are from live dashboard samples. The Intel B60 small coding benchmark was run directly with recorded results. Historical result files from the script-based test harnesses (/tmp/*.tsv) were not present in their expected locations — see the notes in the B60 Test Matrix section.

Main Takeaways

Field notes from several weeks of testing llama.cpp, GGUF models, MTP/speculative decoding, long-context agent workloads, Vulkan vs ROCm backends, and tool-using clients on the AMD Radeon AI PRO R9700. This is not a clean lab benchmark — it is real agent sessions, long prompts, model switches, failed experiments, power draw surprises, runaway reasoning loops, and enough token burn to start seeing patterns.

MTP is real and worth testing. On the strongest Qwen MTP models, draft acceptance often landed in the low-to-high 80% range. That translated into very strong generation speeds for a 32 GB local GPU.

The backend matters, but model/config matters more. Vulkan and ROCm can both run the same GGUF files. No model conversion is needed to compare them. In practice, Vulkan became the more comfortable default for much of this testing, but every llama.cpp update and model family can change the balance.

Long context is practical, but it is not free. The R9700 can run large-context profiles, including 192k and 262k context, if the model and KV cache are chosen carefully. KV cache format matters — some tests kept both f16 KV and q8 KV variants because q8 KV allowed larger usable contexts.

Reasoning models need budgets. llama.cpp defaults to unrestricted reasoning when reasoning is enabled and no budget is provided. In logs this appeared as reasoning-budget: activated, budget=2147483647 tokens. Combined with clients requesting 16k to 32k output tokens, this caused long thinking loops.

Tool clients are fragile with local models. Cline/OpenCode do not just need a smart model. They need a model that produces parseable tool/action output. Some models were fast and smart but drifted into malformed tool calls, reasoning-only responses, or repeated self-analysis.

Not all "Qwen" models are multimodal. The Qwen3.6 text/MTP models tested here do not support images. Qwen-VL models do, but they require separate model/projector setup.

Live Hardware Snapshot

Two actual machines with telemetry from their ServerTop dashboards at sample time.

Host Hardware Memory GPU Live llama.cpp State Notes
intelgpullm Intel Core i5-11600K, 12 logical CPUs 48 GB RAM, 8 GB swap Intel Arc Pro B60 / Battlemage G21, 24 GB VRAM gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf, Vulkan, 32k context, healthy ServerTop reports the B60 as available through Vulkan with /dev/dri and /dev/dri/renderD128; current counters were idle at sample time.
jkxeon Dual Intel Xeon E5-2690 v3, 48 logical CPUs 64 GB RAM, 8 GB swap AMD Radeon AI PRO R9700, ~32 GB VRAM (reported ~31.9 GiB usable) Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf, Vulkan, 262k context, healthy Live sample showed 103,574 generated tokens, 650,802 prompt tokens, recent generation around 39.5–45.3 TPS, recent prompt around 383–600 TPS, and MTP acceptance averaging about 65.8%.

Storage Context

Host Mount Total Free (live sample)
Intel B60 /models shares root NVMe volume ~466 GB ~383 GB
AMD R9700 /models is RAID10 ~1.0 TB ~663 GB

/backup is RAID1 and /archive is a larger HDD volume on the R9700 host.

Companion Hardware Note

Intel Arc Pro B60 was evaluated and tested with llama.cpp Vulkan/MTP. See the Intel Arc Pro B60 Follow-Up Test section for full results. The B60 showed that Intel Arc can work with llama.cpp in Docker, but PCIe/runtime power management stability must be addressed first (GRUB drop-in for DMC/runtime PM). 24 GB class VRAM with both SYCL and Vulkan paths plausible. Driver maturity is improving. Best use: secondary agent server.

Model Registry Highlights

Representative GGUF profiles across both platforms, including MTP-curated and long-context candidates.

Profile Model Family Context KV/Cache Special Mode Notes
Qwen3.6 35B A3B Opus 4.7 Distilled APEX MTP Qwen3.6 35B A3B 262k default draft-mtp Curated MTP profile; ~100.46 TPS, ~84% acceptance
Qwopus3.6 35B A3B APEX MTP Qwen/Qwopus 35B A3B 262k default draft-mtp ~103 TPS, fastest 35B MTP; hit 4096-token output cap
Qwen3.6 27B Dense MTP Q5_K_M Qwen3.6 27B Dense 192k default draft-mtp 45.54 TPS, 88.995% acceptance; completed coding test cleanly
Qwen3.6 35B A3B MTP Q4 Qwen3.6 35B A3B 262k default draft-mtp MTP MoE profile; ~70.5 TPS in live workload
Qwen3.6 27B Dense Q4 Qwen3.6 27B Dense 192k default none Non-MTP comparison
Qwen3.6 35B MoE Q4 Qwen3.6 35B A3B 262k default none Long-context profile
Gemma 4 26B A4B Q4 Gemma 262k default none ~42–44 TPS; caused more client parser/tool-output issues in Cline-style work
GLM-4.7 Flash Q4 GLM 203k default none Long-context profile; non-Qwen option
Gemma 4 31B Dense Q4 Gemma 160k default none Dense comparison; loaded/tested
Qwen3-Coder 30B A3B Q4 Qwen3-Coder 131k to 256k f16/q8 none Coding and context ladder; q8 KV enables 256k
Devstral Small 2 24B Q5 Devstral 96k to 160k f16/q8 none Agent/coding candidate; non-Qwen alternative
Qwen3.5 4B Q4 Qwen3.5 64k / 128k default none SSH agent small model
Qwen3.5 9B Q4 Qwen3.5 64k / 128k default none SSH agent small model

Intel Arc Pro B60 Follow-Up Test

A later follow-up run tested the same general llama.cpp/Vulkan/MTP workflow on an Intel Arc Pro B60 server — a separate box from the R9700 system. It exposed a different class of problem: not model fit or raw throughput first, but PCIe/runtime power-management stability.

B60 Test System

Component Notes
GPUIntel Arc Pro B60 / Battlemage G21
VRAM24 GB class; llama.cpp reported 24,480 MiB total, ~21,995 MiB free after startup
CPUIntel Core i5-11600K class host
Host RAM~46 GiB available
KernelUbuntu kernel 7.0.0-22-generic during the test
Runtimellama.cpp server in Docker
BackendVulkan
llama.cpp imageghcr.io/ggml-org/llama.cpp:server-vulkan, build 9538 / 5343f4502
Serving flags--metrics, --cache-ram 0, --parallel 1, --no-warmup, -ngl 99
Small coding-test context4096

The first benchmark attempt made the server look idle or broken because Vulkan had fallen back to CPU/llvmpipe. The llama.cpp container logs showed:

llama.cpp Vulkan fallback error
MESA: error: Unknown kernel mode driver
warning: no usable GPU found, --gpu-layers option will be ignored

The kernel was also logging xe driver force-wake and PCIe power-state errors. The GPU and upstream PCI bridges were stuck in D3cold, and soft PCI resets failed with I/O errors. A warm reboot was not enough; after the first reboot the GPU briefly appeared and then runtime-suspended back into an unusable state.

The durable workaround was to disable the runtime power-management path for this card/driver combination with a GRUB drop-in:

GRUB cmdline drop-in
GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT pcie_port_pm=off xe.dmc_firmware_path=/dev/null xe.enable_dc=0 xe.disable_power_well=0"

After regenerating GRUB and rebooting, the card stayed in D0, /dev/dri remained present, and vulkaninfo reported the real discrete GPU:

vulkaninfo output after fix
deviceName = Intel(R) Arc(tm) Pro B60 Graphics (BMG G21)
driverName = Intel open-source Mesa driver

llama.cpp inside Docker then reported:

llama.cpp Vulkan device detection
Vulkan0 : Intel(R) Arc(tm) Pro B60 Graphics (BMG G21) (24480 MiB, 21995 MiB free)

B60 Small Coding Benchmark

The B60 benchmark used a compact coding prompt:

Coding challenge prompt
Write a Python function parse_duration(text) that converts strings like "2h 30m", "45s", and "1d 2h" to seconds. Include exactly three assert tests. Return only code.
Setting Value
Context4096
BackendVulkan
GPU offload-ngl 99
Parallelism--parallel 1
Prompt cache--cache-ram 0
Generation cap256 tokens
SamplingTemperature 0.2, top-p 0.9
MTP settingQwen models compared with and without --spec-type draft-mtp --spec-draft-n-max 2

Results

Model Context MTP Load Time Prompt TPS Gen TPS Wall Time Status
stories15M q4_04096No1 s93.40434.670.72 sOK
Qwen3.6 35B A3B IQ4_XS4096No16 s24.3624.5312.33 sOK
Qwen3.6 35B A3B IQ4_XS4096draft-mtp9 s16.6229.6411.41 sOK
Gemma 4 26B A4B QAT Q4_K_XL4096No17 s23.4157.146.58 sOK
Qwen3.6 27B IQ4_XS4096No16 s13.736.7041.57 sOK
Qwen3.6 27B IQ4_XS4096draft-mtp9 s12.179.9929.42 sOK
Gemma 4 31B IQ4_XS4096No19 s13.506.0530.59 sOK

MTP Effects on the B60

Model Non-MTP Gen TPS MTP Gen TPS Gen TPS Gain Non-MTP Wall MTP Wall Wall-Time Change
Qwen3.6 35B A3B IQ4_XS24.5329.64+20.8%12.33 s11.41 s7.5% faster
Qwen3.6 27B IQ4_XS6.709.99+49.1%41.57 s29.42 s29.2% faster

The B60 result is not directly comparable to the R9700 long-context runs because this was a 4096-context smoke-style coding benchmark after a driver recovery. Still, it gives a useful practical read: MTP helped both Qwen models, and it helped the 27B dense model much more than the 35B A3B MoE in this specific short-output test.

B60 Lessons

Intel Arc can work with llama.cpp Vulkan in Docker, but the host GPU power state has to be healthy first. The failure mode looked like a model/load problem at first, but it was really a host PCIe/runtime-PM problem. When Vulkan only saw llvmpipe, large models either failed to load or were killed while falling back toward CPU behavior.

For this B60 box, disabling DMC/runtime PM was the key stability fix. xe.dmc_firmware_path=/dev/null caused the kernel to log Disabling DMC firmware and runtime PM, and after that the card stayed in D0. The additional pcie_port_pm=off, xe.enable_dc=0, and xe.disable_power_well=0 settings kept the surrounding PCIe/display power-saving behavior conservative.

The B60 had less memory headroom than the R9700. llama.cpp reported about 24.5 GiB total Vulkan memory and about 22.0 GiB free at startup. That is enough for useful 27B/35B GGUF testing at small context, but it is a different envelope from the 32 GB R9700 long-context profiles.

Gemma 4 26B was the fastest large model in this small B60 coding test. It reached 57.14 gen TPS, but the excerpt was low-quality/repetitive for the coding prompt. As with the R9700 notes, raw throughput alone is not enough for agent usefulness.

MTP remained worthwhile on Intel Vulkan. The Qwen35 A3B MTP run gained about 20.8% generation throughput, while Qwen27 gained about 49.1%. Even without retained draft-acceptance logs for this B60 run, the end-to-end wall-time improvement was clear.

AMD R9700 Test Matrix and Live Results

Highest-confidence measured values currently available from live ServerTop metrics on the jkxeon host.

Live R9700 llama.cpp Sample

Source: ServerTop dashboard http://192.168.1.116:8090/api/metrics and /api/models

Field Value
Hostjkxeon
GPUAMD Radeon AI PRO R9700, ~32 GB VRAM
Active backendVulkan
Active modelQwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf
Active context262,144 tokens
HealthHealthy
Generated tokens total103,574
Prompt tokens total650,802
Recent generation TPS43.98, 39.53, 44.33, 45.31, 43.63
Recent prompt TPS383.17, 480.69, 423.37, 519.30, 600.01
Dashboard average generation TPS45.4143
Dashboard average prompt TPS953.836
MTP acceptance latest68.75%
MTP acceptance average65.79%
MTP accepted/generated counters76,945 accepted / 85,918 generated

Installed R9700 Registry Highlights

The R9700 dashboard reported the curated registry as installed for the main long-context and MTP profiles.

Profile Context Special Mode Installed Active
Qwen3.6 35B A3B Opus 4.7 Distilled APEX MTP I-Balanced 262k draft-mtp yes yes
Qwopus3.6 35B A3B v1 APEX MTP I-Balanced 262k draft-mtp yes no
Qwen3.6 27B Dense MTP Q5_K_M 192k draft-mtp yes no
Qwen3.6 27B Dense MTP IMAT IQ4_XS + Q8 NextN 192k draft-mtp yes no
Qwen3.6 35B A3B MTP Q4 262k draft-mtp yes no
Qwen3.6 35B A3B MTP IMAT Q4_K_M + Q8 NextN 262k draft-mtp yes no
Qwen3.6 35B A3B APEX MTP I-Balanced 262k draft-mtp yes no
Qwen3.6 27B Dense Q4 192k none yes no
Qwen3.6 35B MoE Q4 262k none yes no
Gemma 4 26B A4B Q4 262k none yes no
GLM-4.7 Flash Q4 203k none yes no
Gemma 4 31B Dense Q4 160k none yes no
Qwen3-Coder 30B A3B Q4 131k / 256k q8 KV on 256k profile yes no
Devstral Small 2 24B Q5 96k / 160k q8 KV on 160k profile yes no
Qwen3.5 4B / 9B agent profiles 64k / 128k none yes no

Test Methodology

Approaches and scripts used across both platforms.

Serving Flags

Most important flags for successful MTP runs:

--metrics --cache-ram 0 -c <context> --parallel 1 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 --poll 0 --poll-batch 0 --spec-draft-poll 0 --spec-draft-poll-batch 0

For Qwen reasoning models:

--reasoning-budget 8192

Extremely important. Without it, llama.cpp treats the thinking budget as unrestricted.

Prompt Cache: Disabled with --cache-ram 0. Long agent prompts already consume a lot of memory. The prompt cache could grow large enough to create RAM pressure and odd stalls. The tradeoff is that some repeated prompt processing gets slower, but the system avoids nasty pauses and memory migration behavior.

Context Acceptance Based on context_acceptance.py, context_ladder.py, context_refine.py, and context_refine_qwen36.py. Start llama.cpp container for one model/context combination, wait for health endpoint, run a short completion prompt. Record pass/fail, reported context, timings, and log tail. Refine around likely maximum usable context sizes. AMD output target: /models/gguf/context-acceptance-results.json. B60 output targets: /tmp/llama_ctx_results.tsv and /tmp/llama_ctx_refine.tsv.
Small Agent Model Benchmarks Based on qwen35_context_bench.py. Test Qwen3.5 2B/4B/9B Q4 models at 32k, 64k, and 128k contexts. Use SSH-agent style prompts. Warm up once, then record multiple runs. Record prompt/generation speed and VRAM state. Output target: /models/gguf/qwen35-small-context-bench.json.
Backend and Runtime Updates Based on bench-mtp-request.json, bench-mtp-long-request.json, check_runtime_updates.py, verify-webtop-backends.py, and ServerTop backend switching. Compare ROCm and Vulkan behavior under llama.cpp. Verify active container image and backend state. Use ServerTop Web to switch and inspect live metrics. R9700 live sample currently shows Vulkan active, with ROCm still present as an available backend profile. B60 live sample shows Vulkan available/active through Intel /dev/dri devices.
MTP and Speculative Decoding Based on mtp_context_probe.sh, parse-mtp-acceptance.py, and ServerTop MTP counters. Start MTP-capable profiles with --spec-type draft-mtp and --spec-draft-n-max 2. Sweep contexts from 8k through 262k. Record load state and completion timings. Parse #gen tokens and #acc tokens from logs for acceptance information. Live R9700 sample shows active MTP profile with dashboard history and an average acceptance rate of about 65.8%.

Controlled 27B MTP Coding Tests

Most apples-to-apples controlled test: two Qwen3.6 27B dense MTP models on a coding prompt asking for a thread-safe EventIndex, filters, indexes, complexity notes, and pytest coverage.

Setting Value
Context196,608
BackendVulkan
MTP--spec-type draft-mtp, --spec-draft-n-max 2
SamplingTemperature 0.1
ThinkingDisabled
Max output4096

Results

Model Finish Output Tokens Generation Time Gen TPS MTP Acceptance Notes
froggeric Qwen3.6 27B MTP Q5_K_M Stop 3,651 80.163 s 45.54 88.995% Complete answer; code block closed; tests included.
localweights Qwen3.6 27B MTP IMAT IQ4_XS + Q8 NextN Stop 3,625 71.761 s 50.51 89.034% Complete answer; faster; very similar acceptance.

Conclusion: the IMAT IQ4_XS + Q8 NextN version was the better daily-driver 27B profile — faster while preserving a complete answer. The Q5_K_M version remains a useful quality baseline.

35B MTP and Opus/Qwopus Variants

The 35B A3B MTP family became the most interesting class of models. Small enough in active parameters to run comfortably on the R9700, but large enough to behave more capable than smaller dense models. MTP made them feel much less like a compromise.

Model Gen TPS MTP Acceptance Behavior
Qwen3.6 35B A3B Claude 4.7 Opus Reasoning Distilled APEX MTP ~100.46 ~84.0% Strong coding test, fast, but prone to long internal reasoning if unbounded.
Qwopus3.6 35B A3B v1 APEX MTP ~103.0 ~86.7% Fastest observed 35B result; verbose and hit 4096 output cap.

Above 80% acceptance, speculative output becomes a major quality-of-life improvement. The model no longer feels like a big local compromise — it feels like an interactive coding model.

MTP in Practice

MTP is not just "more tokens per second." The draft model proposes future tokens; the main model accepts or rejects them. If acceptance is high, you get more output per expensive decode step. If acceptance is low, the speculative path can become less useful.

Acceptance Ranges

Acceptance Range Practical Meaning
85–90%Excellent. MTP is very likely helping.
80–85%Good. Still worth using.
70–80%Probably useful, but compare against non-MTP.
Below 70%Needs closer testing; may depend on prompt type.

Observed MTP Acceptance Examples

Model / Run Acceptance
Qwen3.6 27B MTP Q5_K_M coding test88.995%
Qwen3.6 27B MTP IMAT IQ4_XS + Q8 NextN coding test89.034%
Qwen3.6 35B Opus Reasoning Distilled APEX MTP~84.0%
Qwopus3.6 35B A3B v1 APEX MTP~86.7%
Long live Qwen MTP request~82.1%

The best MTP models were both fast and stable. The weaker failure mode was not usually bad MTP acceptance — it was clients giving the reasoning model too much room to think.

Results

Combines controlled small coding tests, longer 4096-token coding tests, live dashboard readings, and real agent-session observations. Treat as practical operating data, not as a standardized leaderboard.

Model / Profile Quant / Format Context Backend MTP Observed Gen TPS MTP Acceptance Notes
Qwen3.6 35B A3B MTP Q4 Q4_K_M GGUF 262k Vulkan / ROCm Yes ~70.5 Not captured in that screenshot Strong early MTP result; good proof that 35B-class MoE can be fast locally.
Qwen3.6 35B A3B MTP IMAT Q4_K_M + Q8 NextN IMAT Q4 + Q8 NextN 262k Vulkan Yes ~50.3 Dashboard tracked separately Lower power draw than expected; ~27.3 GiB VRAM, 153 W in one observed run.
Qwen3.6 35B A3B APEX MTP I-Balanced APEX / I-Balanced 262k Vulkan Yes Tested, good enough to add Not retained Added after coding evaluation; later superseded by Opus/Qwopus variants.
Qwen3.6 35B A3B Claude 4.7 Opus Reasoning Distilled APEX MTP I-Balanced GGUF 262k Vulkan Yes ~100.46 ~84.0% Very fast, but reasoning behavior can loop if not budgeted.
Qwopus3.6 35B A3B v1 APEX MTP I-Balanced GGUF 262k Vulkan Yes ~103.0 ~86.7% Fastest observed 35B result; verbose; hit 4096-token limit in test.
Qwen3.6 27B Dense MTP Q5_K_M Q5_K_M MTP GGUF 192k Vulkan Yes 45.54 88.995% (2337/2626) High-quality dense baseline; output completed cleanly in 4096-token coding test.
Qwen3.6 27B Dense MTP IMAT IQ4_XS + Q8 NextN IQ4_XS + Q8 NextN 192k Vulkan Yes 50.51 89.034% (2322/2608) Best practical 27B dense result: faster than Q5 while completing code tests.
Qwen3.6 27B Dense MTP IMAT IQ4_XS + Q8 NextN (real agent run) IQ4_XS + Q8 NextN 192k Vulkan Yes ~42.5 sustained ~82.1% Long reasoning/tool session decoded 16k+ tokens steadily; not stalled, just too much thinking.
Gemma 4 26B A4B Q4 GGUF 262k Vulkan No ~42–44 N/A Useful model, but caused more client parser/tool-output issues in Cline-style work.
Gemma 4 31B Dense Q4 GGUF 160k Vulkan No Not retained N/A Loaded/tested; user planned to evaluate Gemma behavior separately.
GLM-4.7 Flash Q4_K_M GGUF ~203k Vulkan / ROCm No Not retained N/A Kept as non-Qwen option; less detailed performance retained.
Qwen3-Coder 30B A3B Q4_K_M, f16 KV 131k Vulkan / ROCm No Not retained N/A Kept as KV-cache comparison profile.
Qwen3-Coder 30B A3B Q4_K_M, q8 KV 256k Vulkan / ROCm No Not retained N/A q8 KV enabled larger context profile.
Devstral Small 2 24B Q5_K_M, f16 KV 96k Vulkan / ROCm No Not retained N/A Kept as non-Qwen coding alternative.
Devstral Small 2 24B Q5_K_M, q8 KV 160k Vulkan / ROCm No Not retained N/A Larger context profile retained.

Power, Clocks, and Idle Behavior

The R9700 behaved differently depending on model and backend.

State / Model GPU Usage VRAM Temp Power Clock Notes
Non-MTP idle ~3% ~28.5 GiB 46 C ~23 W Low Normal idle behavior.
MTP idle before tuning 100% ~28.9 GiB 48 C ~83–90 W Elevated Looked busy while idle; likely polling/spin behavior.
IQ4/Q8 NextN active run 100% ~27.3 GiB 66 C ~153 W ~3.21 GHz Lower power than expected for the output rate.
Heavy long generation 100% ~89% allocated 69 C edge / 93 C junction / 88 C memory ~377 W ROCm reported Hot, real workload; not stalled.

The idle 100% issue mattered because a model that is not actively generating should not sit at high clocks and draw needless power. Tuning poll behavior allowed idle to drop back toward low usage/power, at the cost of a slightly slower wake-up.

LACT was tried for overclocking/undervolting but was not useful for increasing memory clock on this setup, so it was removed. The practical tuning path ended up being software/runtime tuning, not GPU overclocking.

Thinking Loops and Client Configuration

The most painful failures were not simple crashes. They were "the model is alive but doing the wrong thing forever."

Qwen Reasoning Loop

One OpenCode trace showed the model repeating variations of the same thought:

OpenCode trace snippet
I see the issue now - the echo lines need to mirror the infinity symbol's actual shape...

It repeated the same geometric analysis for minutes. Root causes:

Setting / Behavior Effect
OpenCode model marked "reasoning": trueClient intentionally triggered reasoning mode.
Output limit around 32768Client allowed huge completions.
llama.cpp default reasoning budget -1Unrestricted thinking.
Server log budget 2147483647Effectively infinite reasoning budget.
No small model configuredUtility calls used the same large reasoning model.

Fixes Applied

Client limit settings payload
{"limit": {"context": 196608, "output": 8192}}
Server daemon thinking budget flag
--reasoning-budget 8192

We also recommended removing small_model when it pointed to the same large reasoning model. A "small model" that is actually the same 35B reasoning model only makes utility calls slower and more expensive.

Cline Invalid API Responses

Cline began showing errors like:

Cline client console error output
Invalid API Response: The provider returned an empty or unparsable response.

Likely causes:

The local model emitted malformed tool calls.

The model produced reasoning-only content that the client did not treat as assistant content.

The response was too long and drifted away from Cline's expected syntax.

The loaded server model and the client-selected model drifted apart.

One observed state had Cline-style requests hitting Gemma 4 26B with:

Parameter Value
Chat formatpeg-gemma4
Max tokens16384
Reasoning formatdeepseek
Streamingtrue
MTPnone

That is a fragile setup for a tool client. Gemma may be useful for normal coding/chat, but the local Qwen MTP models were more promising for Cline/OpenCode tool workflows.

Screenshots

ServerTop llama.cpp panel — Current model, backend, context, token counters, prompt/gen TPS, and MTP acceptance.
Model registry — Curated GGUF profiles with context and backend options.
Backend toggle — ROCm/Vulkan control state before and after a switch.
Benchmark terminal — Context ladder or MTP sweep output, with paths redacted as needed.
Results table — Final benchmark table once result JSON/logs are supplied.
Hardware telemetry — GPU VRAM, power, temperature, and utilization during a run.

Client Settings That Matter

For local tool agents, conservative client settings are not optional. They are stability settings.

Setting Recommendation
Max output4096 to 8192
Reasoning budget4096 to 8192
Temperature0.1 to 0.2 for coding/tool work
StepsKeep moderate; avoid 30-step loops by default
Tool modelPrefer the most tool-stable model, not the most verbose model
Small modelUse a real smaller model or omit it
Vision attachmentsDo not send images to text-only Qwen models

The biggest lesson: local agent stability depends on the contract between the client and the model. A model can be fast, capable, and still bad at the exact syntax a tool client expects.

Recommended Model Roles

Based on the tests and agent workloads observed.

Role Recommended Model Type
Daily local coding agentQwen3.6 27B Dense MTP IMAT IQ4_XS + Q8 NextN
Higher-capability local coding/reasoningQwen3.6 35B A3B Opus/Qwopus MTP with reasoning budget
Fast long-context MoE experimentsQwen3.6 35B A3B MTP Q4/Q8 NextN profiles
Non-Qwen comparisonDevstral Small 2 24B, GLM-4.7 Flash, Gemma 4
Vision/screenshot workSeparate Qwen-VL or other multimodal model, not current Qwen3.6 MTP
Cline/OpenCode tool usePrefer Qwen MTP over Gemma until tool formatting is proven

Practical Recommendations

If rebuilding this from scratch.

1
Start with a known-good llama.cpp build and one backend

Don't try to compare backends on day one. Get one working, then add the other as controlled comparison.

2
Use Vulkan first if it is already stable on the host

In practice, Vulkan became the more comfortable default for much of this testing.

3
Add ROCm only as a controlled comparison

Same model file, same context, same MTP flags — switch only the runtime image/backend.

4
Use GGUF Q4/IQ4/Q5 models that fit with the target context

Quantization matters. Keep f16 and q8 KV variants where larger contexts are needed.

5
Add MTP only when a model provides matching MTP/NextN support

MTP without a matching draft model wastes tokens and can hurt performance.

6
Track MTP acceptance over time

A point-in-time acceptance rate can look good while a long agent session later falls into a very different token distribution.

7
Cap reasoning budget immediately

--reasoning-budget 8192 prevents unlimited thinking loops. Without it, the default budget is effectively infinite.

8
Cap client output tokens immediately

4096 to 8192 is a practical range. Clients requesting 16k to 32k output tokens cause runaway loops.

9
Keep one non-reasoning model available for tool clients

Tool clients need parseable output, not deep internal reasoning.

10
Do not send images to text-only models

The Qwen3.6 MTP and reasoning models tested here are text-only. Qwen-VL models exist but require separate setup.

Suggested Default llama.cpp Posture for Qwen MTP

Recommended llama.cpp server flags
--cache-ram 0 -c 196608 --parallel 1 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 --reasoning-budget 8192

Suggested Default Client Posture

Client parameters configuration
{"temperature": 0.1, "limit": {"output": 8192}}

For more fragile clients like Cline, 4096 output tokens may be a better starting point.

Practical Conclusions

Long-context GGUF testing benefits from a repeatable model switcher because Docker command variation becomes error-prone.

Backend choice matters enough to expose ROCm/Vulkan switching in the dashboard rather than bury it in scripts.

Prompt cache behavior can become a system-level issue during long agent prompts, so ServerTop launches with --cache-ram 0.

MTP testing needs acceptance metrics, not just tokens per second.

Small Qwen3.5 agent profiles were tested as practical SSH-agent candidates at 64k and 128k contexts.

Reasoning models need budgets. Without --reasoning-budget, llama.cpp defaults to effectively infinite thinking, causing loops that waste tokens and time.

The core lesson: local models are not just "download model, run server." The model, quant, backend, context, MTP settings, reasoning budget, and client parser all form one system. When they line up, a 32 GB local GPU can deliver surprisingly strong results. When they do not, it can spend ten minutes thinking about an SVG curve and never actually fix the file.

Current Status and Next Steps

Current

  • Test harnesses and dashboard controls exist.
  • Curated model registry exists.
  • Multiple context ladder and MTP probe scripts exist.
  • Live telemetry and controlled coding tests captured on R9700.
  • MTP acceptance tracking and poll tuning documented.
  • Intel B60 follow-up coding benchmark completed with Vulkan/MTP results.
  • B60 PCIe/runtime PM stability fix documented (GRUB drop-in).