Google 26B-A4B params BF16 + MTP MoE

Gemma-4-26B-A4B-it MTP

Interesting middle position on vLLM v0.23.0. MTP moves chat to 17.79 t/s/user and multi-turn to 16.57 t/s/user without switching to the NVIDIA NVFP4 re-quant. Not the best tails, but much more decode than BF16.

70.1
Arena score
81
Throughput tok/s
52 GB
VRAM
8/9
Benches measured
Hugging Face → vLLM v0.23.0 DGX Spark, NVIDIA GB10, 128 GB unified memory Last measured 23 June 2026

The quality component of the Arena score. Not measured by me, from the vendor's official model cards. For cross-model comparison with a consistent eval harness, Artificial Analysis is a useful third party. The average of the three benchmarks feeds one-to-one into the Score formula (weighted heavier in Aggregate/Agent, lighter in Batch).

80.7
Avg
82.6
MMLU-Pro
82.3
GPQA-Diamond
77.1
HumanEval
Decode throughput · total t/s · c=10
BF16 + MTP
1k ctx 139 t/s
8k ctx 97.0 t/s
4k+turn 143 t/s
25k ctx 28.0 t/s

5 closed-loop tests with llama-benchy and 4 open-loop tests with vllm bench serve. Per benchmark the tokens/sec (decode throughput) and TTFT p50. TTFT translates directly into UX feel, tps into capacity. Expand "view command" for the exact command.

01 · llama-benchy closed-loop

Chat

Korte prompt, lang antwoord. De vorm die als normale chat moet aanvoelen, TTFT bepaalt of het "snappy" is.

pp (prompt) 1024 tg (gen) 1024 depth 0 concurrency 10 runs 3
tokens/sec
17.8 t/s
TTFT · p50
1,40s
3 runs · seed 42
02 · llama-benchy closed-loop

RAG · 8k context

Middelgrote context, een paar documentchunks met antwoord van normale lengte. Toont prefill-kosten zonder de muur te raken.

pp (prompt) 8192 tg (gen) 512 depth 0 concurrency 10 runs 3
tokens/sec
13.2 t/s
TTFT · p50
9,52s
3 runs · seed 42
03 · llama-benchy closed-loop

Lange output / agents

Korte instructie, veel output. Code-generation, rapporten of gestructureerde agent-output. Stress-test voor decode throughput.

pp (prompt) 256 tg (gen) 4096 depth 0 concurrency 10 runs 3
tokens/sec
17.7 t/s
TTFT · p50
564ms
3 runs · seed 42
04 · llama-benchy closed-loop

Grote context · 25k

Stress-test met grote prompts. Niet per se chatmateriaal, wel exact waar de prefill-muur zichtbaar wordt en TTFT instort.

pp (prompt) 25000 tg (gen) 256 depth 0 concurrency 10 runs 3
tokens/sec
6.0 t/s
TTFT · p50
45,64s
3 runs · seed 42
05 · llama-benchy closed-loop

Multi-turn · kantoorwerk

Vijf beurten per gesprek, tien gesprekken parallel. Dicht bij hoe een team dit echt gebruikt, met groeiende context per turn.

pp (prompt) 2048 tg (gen) 512 depth 4 concurrency 10 runs 3
tokens/sec
16.6 t/s
TTFT · p50
2,37s
3 runs · seed 42
06 · vllm bench serve open-loop

Realistische kantoor-baseline

Random dataset · 4000 tokens in, 500 tokens uit · request-rate 0.3, burstiness 0.7. Een rustig kantoor.

dataset random rate (req/s) 0,30 burstiness 0,7 prompts 200
tokens/sec
47.8 t/s
TTFT · p50
1,61s
200 prompts · seed 42
07 · vllm bench serve open-loop

Echte gesprekken · ShareGPT

ShareGPT V3 · gemiddeld 228 tokens per turn · natuurlijk variërend per gesprek. Wat real users doen, niet een synthetische random distributie.

dataset sharegpt v3 rate (req/s) 0,30 burstiness 0,7 prompts 250
tokens/sec
11.1 t/s
TTFT · p50
409ms
250 prompts · seed 42
08 · vllm bench serve open-loop

Maandagochtend-piek

Random · 4000 in / 500 uit · request-rate 1.5 req/s, burstiness 1.0, max 25 parallel. Wanneer iedereen tegelijk inlogt, zien we de queue groeien?

dataset random rate (req/s) 1,50 burstiness 1,0 prompts 300 max parallel 25
tokens/sec
53.5 t/s
TTFT · p50
1,68s
300 prompts · seed 42
09 · vllm bench serve open-loop

Reasoning workload

Lange chain-of-thought outputs · 1k in / 4k uit · trage rate (0.2 req/s) want elke request kost veel decode-budget. Test of TTFT stabiel blijft.

dataset random rate (req/s) 0,20 burstiness 1,0 prompts 50
tokens/sec
t/s
TTFT · p50
50 prompts · seed 42
What works

NVFP4 is the practical default

Chat at 21.59 t/s/user and multi-turn at 20.01 t/s/user at c=10. For local office chat this does not feel like a compromise.

What broke

25k context is still prefill pain

Even NVFP4 sits at 38.58s average TTFT for 25k and c=10. Serving profile helps decode, not the wait before large prompts.

What disappointed

MTP buys decode, not perfect tail

MTP beats BF16 on decode, but under Monday peak load its p95 TTFT and p95 TPOT are worse than BF16. Percentiles still matter.

What surprised

ShareGPT replay is extremely friendly

NVFP4 completes 250/250 requests with p95 TTFT 225.09 ms and p95 TPOT 45.30 ms. Real short conversations are much lighter than random 4k.

More numbers?
Read the full article.

Explanation

Esc