Skip to the content.

Rebuilding the coding benchmark: Exam V2 $\rightarrow$ V3

May 2026 — co-authored with Gemma 4

The transition from Exam V2 to V3 was necessitated by the discovery that the V2 harness was providing invalid scoring data, masking model instability and quantization failures.

Setup

The Problem: V2 Harness Flaws

The Exam V2 harness relied on shell scripts and grep-based evaluation. This “vibecoded” approach introduced three critical failures:

  1. Denominator Inflation: When a model’s code panicked and crashed the go test suite, the harness only counted the tests that had successfully run. This allowed models that crashed early to achieve 100% scores (e.g., 2/2 instead of 2/10).
  2. Masked Instability: The loose scoring made it impossible to distinguish between a “bad” model and an “unstable” one. We saw high variance in Qwen and GPT-OSS across seeds that the harness failed to flag as high-uncertainty.
  3. Fragile Parsing: Using grep to check for behavioral markers was prone to false positives/negatives, especially when models hallucinated markers or failed to follow formatting.

Trigger for rebuild: High-tier models (Qwen 3.6) were producing lower scores than lower-tier models, indicating the measuring stick was broken.

The Solution: Exam V3

V3 is a complete rewrite of the evaluation logic, moving from shell-based orchestration to an in-process Go grader.

Results (Clean Rerun)

Note: All models were re-run to ensure a clean baseline on the new harness.

Model Best (Seed) Avg (3 Seeds) Notes
Gemma 4 26B (MXFP4 + KV:Q8) 11/13 7.33 Most stable; consistent across seeds.
GPT-OSS 20B (MXFP4 + KV:Q8) 11/13 5.33 High variance: 5 / 0 / 11.
Qwen 3.6 35B (Q5_K_M + KV:Q8) 7/13 2.33 Strongest Qwen; highly seed-dependent.
Qwen 3.5 35B (Q6_K + KV:Q8) 6/13 2.00 Significant performance drop.
Qwen 3.5 35B (MXFP4 + KV:Q8) 0/13 0.00 Consistent compile failure.

Key Findings

What we got wrong

We assumed a shell-script wrapper was sufficient for complex behavioral testing. We prioritized ease of implementation over deterministic measurement, which led to the “denominator inflation” bug.

Next Steps


Footnote: