Gemma 4 vs Qwen3.5: benchmarking quantized local LLMs on Go coding
April 20261
Part 3/3 — Part 3 ← Part 2 ← Part 1
In episode 1 three models tied at 13/15. The test was too easy — it couldn’t separate a good model from a mediocre one having a lucky run. Since then Qwen3.5, Gemma 4, and Qwen3-Coder dropped. They’d all tie too. We needed a harder exam and better methodology. We also suspected (correctly) that single-seed results were noise and that our grep-based scoring was garbage, so we planned for multi-seed and real test execution from the start.
Same hardware: Framework 13, Ryzen AI 370HX, Radeon 890M iGPU, 64 GB DDR5, Vulkan. llama.cpp b8708, models served through llama-swap. All runs: 3 seeds (42, 123, 456), temp 1.0, 16k context, 10 min timeout. --reasoning off for Qwen3.5 (without it, the model burns its entire context on chain-of-thought before writing any code). Started with 15 models, pruned to 11 after the first round — DeepSeek-Coder-V2-Lite, DeepSeek-R1-14B, GLM-4.7-Flash, GLM-4.7-Flash-REAP, and gemma-3n-E4B scored consistently badly.
The two exams
Exam v1 (/15): Factorial, word frequency counter, file tree walker. 5 points each: builds(1) + runs(1) + correct output(3). Easy. A reliability gate — most decent models hit 14/15 every seed.
Exam v2 (/10): Modify a 208-line Go metrics scraper to add resilience — buffer metrics in memory during network outages, randomly evict when the buffer is full, flush via a background goroutine on reconnect. This is the one that separates models.
How the eval works
The first exam_v2 evaluator used grep. It checked whether the code contained sync.Mutex, rand.Intn, go func. Models scored 15–18/20. That was broken — a model can write sync.Mutex and still have three data races.
We replaced it with Go integration tests. The harness compiles the model’s code into a binary, runs it against a mock server with controllable online/offline state, and executes 10 tests:
| Test | What it checks |
|---|---|
| OnlineFlow | Metrics reach the sink when online |
| BuffersDuringOutage | Nothing leaks when offline |
| FlushOnReconnect | Buffered metrics flush on reconnect |
| BufferBounded | Flushed count stays within buffer-size |
| EvictionRandom | Evicted items are a mix of old and new |
| MultipleOutageCycles | Survives 3 offline/online transitions |
| BufferSizeZero | Doesn’t panic with -buffer-size 0 |
| BufferSizeOne | Works with -buffer-size 1 |
| GracefulShutdown | Exits cleanly on SIGINT |
| RaceDetector | Compiled with -race, no data races |
The harness auto-detects flag names from the binary’s -h output, so it adapts to whatever CLI the model invents. The same models that scored 15–18/20 with grep scored 0–8/10 under real execution.
Results
Exam v1: Three Go programs (/15)
| Model | Mean | Range | Tok/s |
|---|---|---|---|
| Qwen3.5-35B Q6_K | 14.3 | 14–15 | 22.1 |
| Gemma 4 26B Q4_K_M | 14.0 | 14–14 | 18.6 |
| Gemma 4 26B MXFP4 | 14.0 | 14–14 | 18.8 |
| Gemma 4 26B Q5_K_M | 14.0 | 14–14 | 17.0 |
| Qwen3.5-35B Q4_K_M | 14.0 | 14–14 | 22.1 |
| Qwen3.5-35B MXFP4 | 14.0 | 14–14 | 21.9 |
| Qwen3-Coder-30B + draft | 14.0 | 14–14 | 25.9 |
| gpt-oss-20b | 14.0 | 14–14 | 27.0 |
| Gemma 4 E4B Q8 | 12.7 | 10–14 | 13.8 |
| Qwen3.5-9B | 12.3 | 9–14 | 13.6 |
| Qwen3.5-35B Q5_K_M | 12.0 | 9–14 | 21.0 |
Bold = 14/15 on all three seeds. The 14–14 vs 9–14 spread is what matters here — it tells you which models are dependable vs. which ones roll dice.
Previously tested (dropped after first round):
| Model | Mean | Range | Notes |
|---|---|---|---|
| DeepSeek-Coder-V2-Lite 16B Q8 | 10.3 | 9–13 | Legacy coder model, outclassed |
| GLM-4.7-Flash 30B Q4_K_M | 10.0 | 9–12 | Dense 30B, too slow for the score |
| gemma-3n-E4B Q8 | 8.0 | 5–10 | Previous gen Gemma, inconsistent |
| GLM-4.7-Flash-REAP 23B Q4_K_M | 7.0 | 4–10 | MoE variant, worse than dense |
| DeepSeek-R1-14B Q4_K_M | 5.0 | 5–5 | Reasoning model with --reasoning off — barely functional |
These models consistently scored below the new contenders and were dropped from exam v2 and the quantization sweep. Qwen3-8B (the episode 1 champion at 4.7 GB) was superseded by Qwen3.5-9B — same family, newer weights.
Exam v2: Resilience modification (/10)
| Model | Mean | Compiles | Tok/s |
|---|---|---|---|
| Gemma 4 26B Q4_K_M | 4.0 | 2/3 | 18.6 |
| Gemma 4 26B MXFP4 | 4.0 | 2/3 | 18.8 |
| Gemma 4 26B Q5_K_M | 4.0 | 2/3 | 17.0 |
| Qwen3-Coder-30B + draft | 3.7 | 2/3 | 25.9 |
| gpt-oss-20b | 3.7 | 2/3 | 27.0 |
| Gemma 4 E4B Q8 | 3.7 | 2/3 | 13.8 |
| Qwen3.5-35B Q5_K_M | 3.3 | 2/3 | 21.0 |
| Qwen3.5-35B Q6_K | 2.7 | 1/3 | 22.1 |
| Qwen3.5-35B Q4_K_M | 2.3 | 1/3 | 22.1 |
| Qwen3.5-9B | 2.3 | 1/3 | 13.6 |
| Qwen3.5-35B MXFP4 | 2.0 | 1/3 | 21.9 |
“Compiles” = produced code that builds and passes at least one test. Every model fails at least one seed. When they do compile and pass, they typically get 5–7/10. Common failures: rand.Intn(0) panic with buffer-size 0, off-by-one in flush logic, concurrent buffer access without locking. 4/10 is the ceiling so far.
Quantization
Gemma 4 26B: quant doesn’t matter.
| Quant | Size | Exam v1 | Exam v2 | Compiles |
|---|---|---|---|---|
| UD-Q4_K_M | 16 GB | 14.0 (14–14) | 4.0 | 2/3 |
| MXFP4_MOE | 15 GB | 14.0 (14–14) | 4.0 | 2/3 |
| UD-Q5_K_M | 21 GB | 14.0 (14–14) | 4.0 | 2/3 |
Identical across all three. Pick the smallest: MXFP4 at 15 GB.
Qwen3.5-35B: Q5_K_M is the most reliable.
| Quant | Size | Exam v1 | Exam v2 | Compiles |
|---|---|---|---|---|
| Q4_K_M | 21 GB | 14.0 (14–14) | 2.3 | 1/3 |
| Q5_K_M | 28 GB | 12.0 (9–14) | 3.3 | 2/3 |
| Q6_K | 32 GB | 14.3 (14–15) | 2.7 | 1/3 |
| MXFP4_MOE | 20 GB | 14.0 (14–14) | 2.0 | 1/3 |
Q5_K_M has the best compile rate on the hard exam (2/3 vs 1/3) but wobbles on the easy one (9–14). More bits doesn’t monotonically help. At 28 GB it still can’t match Gemma 4’s consistency.
We don’t fully understand why Qwen3.5 is so flaky under quantization. On paper it should be the stronger model — it leads on Terminal-Bench 2, SWE-bench, and TAU2 at full precision. But quantized and running locally, it compiles less often than Gemma 4 across every quant we tried. Our best guess: Qwen3.5’s hybrid architecture (Gated DeltaNet + MoE with 256 tiny experts) may be more sensitive to weight precision loss than Gemma 4’s 128-expert MoE. But we haven’t proven that — it’s speculation. If you know more about this, we’d like to hear it.
What we got wrong and fixed
Context truncation. With 8k context, models were writing correct code that got cut off mid-function. Looked like model quality problems. It was infrastructure. Bumped to 16k, half the compile failures disappeared.
Grep scoring. The original exam_v2 eval checked for keywords in the source code. Models scored 15–18/20. Under real execution with integration tests: 0–8/10. We wasted a round of runs on this before rebuilding the eval as a proper Go test harness.
Single-seed noise. Every major conclusion from our initial n=1 runs was wrong or misleading. One model looked terrible on seed 42, great on seed 123. Three seeds is the minimum — we should probably use five, but GPU hours aren’t free on an iGPU.
External context
No published benchmarks test quantized models at this size class. Every leaderboard number is full-precision on server GPUs. For reference, self-reported full-precision scores from the model authors:
| Benchmark | Qwen3.5-35B | Gemma 4 26B |
|---|---|---|
| Terminal-Bench 2 | 40.5% | — |
| SWE-bench Verified | 69.2% | — |
| TAU2 | 81.2 | 68.2 |
| LiveCodeBench | — | 77.1% |
| Arena AI (Elo) | 1400 | 1441 |
Qwen3.5 looks stronger on paper (TAU2, SWE-bench). Gemma 4 edges it on Arena AI. Our local quantized results tell a different story: Gemma 4 is more consistent where it counts — actually compiling and running under constrained context on real hardware.
Conclusions
Gemma 4 26B-A4B MXFP4 is the best local coding model on this hardware. 15 GB, 14/15 rock solid, 4.0/10 on the hard exam, quant-insensitive. 18 tok/s on an iGPU.
gpt-oss-20b is the best value. 12 GB, 27 tok/s, 14/15 rock solid, 3.7/10 on exam v2. Smallest and fastest model that competes with the bigger MoEs. Dense, no draft model needed.
Qwen3.5-35B is fast but flaky. 21–22 tok/s but lower compile rates on the hard exam. Q5_K_M (28 GB) is the best quant if you need it.
Compile rate matters more than peak score. A model that compiles 2/3 seeds and scores 4/10 is more useful than one that compiles 1/3 and scores 8 on a lucky run.
Multi-seed and real execution are non-negotiable. Grep scoring inflated results. Single-seed results were noise. Every major conclusion from the initial n=1 grep-based run was wrong.
Reproduce it
./sweep.sh
All code: exam_v1/, exam_v2/, sweep.sh. Results are local-only (regenerate with ./sweep.sh).
-
Co-authored with Claude Opus 4.6. ↩